0

I have a large dataset of a survey (originally a Stata(.dta) file). I would like to use the code below to convert negative values in my dataset to NA. If a variable has more than 99% NA's the code should drop it.

#mixed data
WVS <- data.frame(file)
dat <- WVS[,sapply(df, function(x) {class(x)== "numeric" | class(x) == "integer"})]

# NEGATIVES -> NA
foo <- function(dat, p){ 
  ind <- colSums(is.na(dat))/nrow(dat)
  dat[dat < 0] <- NA
  dat[, ind < p]
}
# process numeric part of the data separately
ii <- sapply(WVS, class) == "numeric"
WVS.num <- foo(as.matrix(WVS[, ii]), 0.99)
# then stick the two parts back together again
WVS <- data.frame(WVS[, !ii], WVS.num)

This did however not work as it appears that:

> is("S004")
[1] "character"           "vector"              "data.frameRowLabels" "SuperClassMethod"    "index"              
[6] "atomicVector

Str(WVS):

$ S004     :Class 'labelled'  atomic [1:50] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
  .. ..- attr(*, "label")= chr "Set"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
  .. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...

How do I adapt my code to cope with this?

UPDATE:

I have altered the answer below and tried to make it function with a loop (because my dataset is too big for the solution below.

# Creating a column with the same length as the original dataset 
WVSc <- data.frame(x = 1:341271, y = c(NA))

# Loop for every column
for(i in 1:ncol(WVS))
# Replace all negatives with NA if possible
{try(WVS[,i] <- NA^(WVS[,i]<0) * WVS[,i])
# Select columns to keep and create a new dataframe from these columns
col_to_keep <-  sapply(WVSx[,i], function(x) sum(is.na(x)/length(x))
col_to_keep <- names(col_to_keep[col_to_keep <= 0.99])
WVSc < - cbind(WVS,col_to_keep)}

So, the above does not really work. In addition I was hoping to, by looping, remove columns which have more than 99% NA rather than create a new df, and bind the ones that have less.

Tom
  • 2,173
  • 1
  • 17
  • 44
  • 3
    Please provide some example data to work with, along with exact expected output. – andrew_reece Jul 17 '18 at 08:59
  • I am not sure how to.. the data is really weird. I cannot even use `head(WVS)` to provide example data. – Tom Jul 17 '18 at 09:17
  • Create your own dataset then which is similar enough. – s_baldur Jul 17 '18 at 09:20
  • That is nice, but for this dataset I would not even know how. – Tom Jul 17 '18 at 09:22
  • 1
    You don't need to test separately for integer and numeric. Class "numeric" includes types double and integer. Also, use `is.numeric`. How does `is("F116")` relate to your code? – Roland Jul 17 '18 at 10:22
  • Thank you @Roland. I have removed the `| sapply(ES1, class) == "integer"` part. `F116` is one of the variables in the `dataframe` for I would like to apply the code. I was trying to test what class it had, because it apparently was not numeric. – Tom Jul 17 '18 at 10:27
  • What does `str(WVS)` tell you? – Roland Jul 17 '18 at 10:35
  • Added it to the original post. It is orginally a .dta (stata) file. – Tom Jul 17 '18 at 10:37
  • That is not the complete output. You show one list element (most likely a data.frame column). I've never seen a vector of mode "atomic" before. How did you import the file? – Roland Jul 17 '18 at 11:09
  • It's a data set from the World Value Surveys [link] (http://www.worldvaluessurvey.org/WVSDocumentationWVL.jsp). I downloaded it as a `.dta` file to make use of the variable labels. The `.rdata`, is however just as messed up. – Tom Jul 17 '18 at 11:15

1 Answers1

1

Since you haven't provided any example, here's my bulls eye solution. Hopefully this would give you some headstart:

cleanFun <- function(df){

    # set negative values as NA
    df[df < 0] <- NA

    # faster, vectorized solution
    # select numeric columns
    num_cols <- names(df)[sapply(df, is.numeric)]

    # get name of columns with 99% or more NA values
    col_to_remove <- names(df)[colMeans(is.na(df[num_cols]))>=0.9]

    # drop those columns
    return (df[setdiff(colnames(df),col_to_remove)])
}

your_df <- cleanFun(your_df)
YOLO
  • 20,181
  • 5
  • 20
  • 40
  • Thank you for your answer. I have incorporated parts of it in my update. It is still not going completely right however. – Tom Jul 20 '18 at 15:25
  • @TomKisters what's the issue? – YOLO Jul 20 '18 at 16:59
  • I'm doing something wrong with the syntax I think, please see the update in the original post. – Tom Jul 20 '18 at 17:00
  • Yes, there are problems with the way you've defined the function? What I can't understand is, what is the output you desire? – YOLO Jul 20 '18 at 17:21
  • The idea is that a new database is created in which only the columns for which at least 1% is not NA (or less than 99% is NA) are present. By doing it column by column instead of for the whole dataset I hope to not exceed memory. – Tom Jul 20 '18 at 17:28
  • I am still kind of stuck with this. Is there any more advice you can give? – Tom Jul 24 '18 at 09:43
  • @TomKisters does this solution give you memory issues? – YOLO Jul 24 '18 at 12:18
  • Yes, I am applying this to a huge dataset 341271 obs and 1415 var. – Tom Jul 24 '18 at 12:33
  • 1
    @TomKisters made a tiny change in the solution, can you try now? – YOLO Jul 24 '18 at 13:23
  • Thank you so much, I will try immediately. It will take me some time though, because I will have to right in a `try` statement to deal with the non numerical columns. The first line of your answer (`df[df < 0] <- NA`) now gives the error: `Warning message: In Ops.factor(left, right) : ‘<’ not meaningful for factors` – Tom Jul 24 '18 at 13:30
  • I believe my dataframe has variables which have factors. If I understand correctly, your solution cannot deal with those variables as they have no numerical value. I need to find a way to somehow skip these variables, without the function breaking down. – Tom Jul 24 '18 at 13:41
  • 1
    @TomKisters check now please. – YOLO Jul 24 '18 at 14:37
  • 1
    It worked, it worked, it worked. Thank you so, so much! I really appreciate all your help!. PS I tried to remove a `]` in your answer at `num_cols <- names(df)[sapply(df, is.numeric)]]` but it did not allow me without making more alterations – Tom Jul 24 '18 at 15:00
  • glad to know, sorry that `] ` I forgot to remove, I have removed now. – YOLO Jul 24 '18 at 15:13