Questions tagged [missing-data]

For questions relating to missing data problems, which can involve special data structures, algorithms, statistical methods, modeling techniques, visualization, among other considerations.

When working with data in regular data structures (e.g. tables, matrices, arrays, tensors), some data may not be observed, may be corrupted, or may not yet be observed. Treatment of such data requires additional annotation as well as methodological considerations when deciding how to impute or use such data in standard contexts. This becomes a problem in data-intensive contexts, such as large statistical analyses of databases.

Missing data occur in many fields, from survey data to industrial data. There are many underlying missing data mechanisms (reasons why the data is missing). In survey data for example, data might be missing due to drop-out. People answering the survey might run out of time.

Rubin classified missing data into three types:

  1. missing completely at random;
  2. missing at random;
  3. missing not at random.

Note that some statistical analysis is only valid under certain class.

2809 questions
47
votes
4 answers

Dealing with missing values for correlations calculation

I have huge matrix with a lot of missing values. I want to get the correlation between variables. 1. Is the solution cor(na.omit(matrix)) better than below? cor(matrix, use = "pairwise.complete.obs") I already have selected only variables…
Delphine
  • 1,113
  • 5
  • 15
  • 22
44
votes
9 answers

Best way to count the number of rows with missing values in a pandas DataFrame

I currently came up with some work arounds to count the number of missing values in a pandas DataFrame. Those are quite ugly and I am wondering if there is a better way to do it. Let's create an example DataFrame: from numpy.random import randn df =…
user2489252
41
votes
3 answers

How to get Python to gracefully format None and non-existing fields

If I write in Python: data = {'n': 3, 'k': 3.141594, 'p': {'a': 7, 'b': 8}} print('{n}, {k:.2f}, {p[a]}, {p[b]}'.format(**data)) del data['k'] data['p']['b'] = None print('{n}, {k:.2f}, {p[a]}, {p[b]}'.format(**data)) I get: 3, 3.14, 7, 8 Traceback…
Juan A. Navarro
  • 10,595
  • 6
  • 48
  • 52
39
votes
8 answers

Pandas: print column name with missing values

I am trying to print or to get list of columns name with missing values. E.g. data1 data2 data3 1 3 3 2 NaN 5 3 4 NaN I want to get ['data2', 'data3']. I wrote following code: print('\n'.join(map( lambda x :…
LinearLeopard
  • 728
  • 1
  • 6
  • 18
37
votes
1 answer

Multivariate LSTM with missing values

I am working on a Time Series Forecasting problem using LSTM. The input contains several features, so I am using a Multivariate LSTM. The problem is that there are some missing values, for example: Feature 1 Feature 2 ... Feature n 1 …
Marco
  • 1,195
  • 3
  • 18
  • 30
36
votes
7 answers

Missing values in scikits machine learning

Is it possible to have missing values in scikit-learn ? How should they be represented? I couldn't find any documentation about that.
Vladtn
  • 2,506
  • 3
  • 27
  • 23
32
votes
3 answers

Randomly insert NA's values in a pandas dataframe

How can I randomly insert np.nan's in a DataFrame ? Let's say I want 10% null values inside my DataFrame. My data looks like this : df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'b', 'c', 'd', 'e'], …
mitsi
  • 1,005
  • 2
  • 11
  • 15
32
votes
5 answers

Pandas Dataframe: Replacing NaN with row average

I am trying to learn pandas but I have been puzzled with the following. I want to replace NaNs in a DataFrame with the row average. Hence something like df.fillna(df.mean(axis=1)) should work but for some reason it fails for me. Am I missing…
Aenaon
  • 3,169
  • 4
  • 32
  • 60
31
votes
4 answers

Error in na.fail.default: missing values in object - but no missing values

I am trying to run a lme model with these…
InverniE
  • 598
  • 1
  • 7
  • 21
31
votes
2 answers

python scikit-learn clustering with missing data

I want to cluster data with missing columns. Doing it manually I would calculate the distance in case of a missing column simply without this column. With scikit-learn, missing data is not possible. There is also no chance to specify a user distance…
Michael Hecht
  • 2,093
  • 6
  • 25
  • 37
31
votes
6 answers

Replacing NAs in R with nearest value

I'm looking for something similar to na.locf() in the zoo package, but instead of always using the previous non-NA value I'd like to use the nearest non-NA value. Some example data: dat <- c(1, 3, NA, NA, 5, 7) Replacing NA with na.locf (3 is…
geoffjentry
  • 4,674
  • 3
  • 31
  • 37
30
votes
3 answers

Fill in missing pandas data with previous non-missing value, grouped by key

I am dealing with pandas DataFrames like this: id x 0 1 10 1 1 20 2 2 100 3 2 200 4 1 NaN 5 2 NaN 6 1 300 7 1 NaN I would like to replace each NAN 'x' with the previous non-NAN 'x' from a row with the same 'id'…
ChrisB
  • 4,628
  • 7
  • 29
  • 41
29
votes
8 answers

Leaving values blank if not passed in str.format

I've run into a fairly simple issue that I can't come up with an elegant solution for. I'm creating a string using str.format in a function that is passed in a dict of substitutions to use for the format. I want to create the string and format it…
marky1991
  • 580
  • 1
  • 5
  • 12
28
votes
3 answers

Replace NA in column with value in adjacent column

This question is related to a post with a similar title (replace NA in an R vector with adjacent values). I would like to scan a column in a data frame and replace NA's with the value in the adjacent cell. In the aforementioned post, the solution…
hubert_farnsworth
  • 797
  • 2
  • 9
  • 21
27
votes
3 answers

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance: Many R functions have an na.rm flag that when set to TRUE, remove the NAs: >>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67),…
doug
  • 69,080
  • 24
  • 165
  • 199