2

I have a dataframe like this,

scores <-structure(list(student = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 
3L, 3L, 3L), .Label = c("adam", "mike", "rose"), class = "factor"), 
    year = c(2001L, 2002L, 2003L, 2001L, 2002L, 2003L, 2001L, 
    2002L, 2003L), math = c(5L, 3L, 5L, 3L, 2L, 4L, 4L, 2L, NA
    ), english = c(2L, NA, 5L, 4L, NA, 3L, 4L, NA, 4L), history = c(NA, 
    4L, 5L, NA, 3L, 4L, NA, 5L, 3L), geography = c(4L, 5L, 5L, 
    5L, 4L, 4L, 3L, 5L, 3L)), class = "data.frame", row.names = c(NA, 
-9L))

I want to delete the variable for which no student has score for a given year. For example, no student has scores for English in 2002, therefore, I want to delete the variable "english" if my relevant year is 2002. Similarly, no student has score for History in 2001. So, if my relevant year is 2001, the variable "history" should be deleted. If my relevant year is 2003, no variable is deleted because at least one student (more precisely Mike and Adam) has score in the variable "math".

To do this, I built the following function which does the job

byearNA<-function(x,z = 3, ano = 2001) {
    matri <- data.frame(matrix(, nrow=nrow(x), ncol=(z-1)))
    matri <- x[c(1:(z-1))]
    for (i in z:ncol(x)){
        if (all(is.na(x[x[2] == ano,i]))==FALSE) {
            matri <- cbind(matri,x[i])
        }
    }
    return(matri)
}

However, I really believe this can be done with native functions in R (functions that already exist). I have tried for long but I couldn't find a way and that is why I created my own function.

How can I achieve this task with native functions in R?

Very much thank you in advance

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    I think you might be better off storing this data in a long format, e.g.: `scolong <- reshape(scores, idvar=c("student","year"), varying=list(-(1:2)), times=names(scores)[-(1:2)], direction="long")` Then it is easy to just drop whole measurements using the below mentioned `na.omit`, without affecting the other valid measurements - `na.omit(scolong[scolong$year == 2002,])` – thelatemail Nov 02 '20 at 03:14

1 Answers1

1

I'm not 100% sure what you are looking for but have you tried this?

scores2 <- na.omit(scores)

This will return the 2 rows where there are complete cases (no NA values)

adding some lines after thelatemail comments ... storing in long format is a good idea. you're going to want to work with a long data frame if you don't want to see NA values in your table here is a dplyr method

scores_gathered <- gather(scores, "class", "count", 3:6) 

scores_gathered <-scores_gathered %>%
  group_by(year, class) %>%
  summarize(sum = sum(count))

complete_list <- scores_gathered %>%
  drop_na(sum) %>%
  select(year, class) %>%
  mutate(has_students = "yes")
hachiko
  • 671
  • 7
  • 20
  • Dear Josef, very much thank you for your quick response. However, I am not looking for complete cases. In panel data, we have observations for individuals for who there is data for several years. So, as in my example, each individual (Adam, Mike and Rose) may have data for several years (2001, 2002 and 2003). Then, I want to delete all variables for which no individual has data in a given year. For example, let's suppose my year of interest is 2002. if the variable weight wasn't measure in 2002 for any individual (no body was weighted in that year), I want to delete the variable weight. – Reynaldo Senra Nov 02 '20 at 03:01