1

Okay, so I have this data.frame:

        A      B      C
1  yellow purple   <NA>
2    <NA>   <NA> yellow
3  orange yellow   <NA>
4  orange   <NA>  brown
5    <NA>  brown purple
6  yellow purple   pink
7  purple  green   pink
8  yellow   pink  green
9  purple orange   <NA>
10 purple   <NA>  brown

And I am interested in taking all the missing values from the first columns and replace them with the values over from the other columns, as an example with rows 2, 4, 5 and 10.

        A      B      C
1  yellow purple   <NA>
2  yellow   <NA>   <NA>
3  orange yellow   <NA>
4  orange  brown   <NA>
5   brown purple   <NA>
6  yellow purple   pink
7  purple  green   pink
8  yellow   pink  green
9  purple orange   <NA>
10 purple  brown   <NA>

My idea was to loop over the columns to grab the rows with the missing values and replace them with the values in the column to the right but that is also potentially flawed because what if there were 4 columns and two values in columns 2 and 3 were NA. Does anyone have an idea of an algorithm that may work?

JellisHeRo
  • 537
  • 1
  • 7
  • 17

1 Answers1

4

We can loop over the rows and concatenate the non-NA elements followed by the NA elements and assign it back to the dataset

df[] <-  t(apply(df, 1, function(x) c(x[!is.na(x)], x[is.na(x)])))
df
#        A      B     C
#1  yellow purple  <NA>
#2  yellow   <NA>  <NA>
#3  orange yellow  <NA>
#4  orange  brown  <NA>
#5   brown purple  <NA>
#6  yellow purple  pink
#7  purple  green  pink
#8  yellow   pink green
#9  purple orange  <NA>
#10 purple  brown  <NA>

data

df <- structure(list(A = c("yellow", NA, "orange", "orange", NA, "yellow", 
"purple", "yellow", "purple", "purple"), B = c("purple", NA, 
"yellow", NA, "brown", "purple", "green", "pink", "orange", NA
 ), C = c(NA, "yellow", NA, "brown", "purple", "pink", "pink", 
 "green", NA, "brown")), .Names = c("A", "B", "C"), row.names = c("1", 
 "2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame")
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Gets the job done. Thanks! Can you tell me a little more about the algorithm and what exactly the function in the apply function is taking in? Those are each of the rows for the data frame? Also, Ive never seen the `[]` used next to the dataframe object name like you did there. What does that do, too? – JellisHeRo Mar 03 '18 at 02:10
  • @JellisHeRo It is taking in each row of the dataset when we specify `MARGIN = 1` in `apply`. Then, we subset the elements which are non-NA `x[!is.na(x)]`, followed by elements that are NA `x[is.na(x)]` and concatenate it with `c`. The assignment with `[]` ensures that the original data structure is restored – akrun Mar 03 '18 at 02:13
  • Alright so if I understand this correctly, it essentially repurposes the values that we're already in each row. I don't take that the `[]` trick was a new R feature. Maybe it is but it wouldn't surprise me. – JellisHeRo Mar 03 '18 at 02:17
  • 1
    I want to point out that even though my question was apparently asked before, I think your answer is the best of all the answers considered. – JellisHeRo Mar 06 '18 at 15:34