2

I have a data frame with some missing values that I want to take from anther date frame and insert into the first. This works when the values I want to supplement miss in all rows of the first data frame.

Here is a working exaxmple:

dat <- data.frame(code = c("A11", "B22"),
                  age = c(NA, NA),
                  sex = c(NA, NA),
                  more = c(7, 4),
                  stringsAsFactors = FALSE)

age.and.sex <- read.table(textConnection("
code age sex
A11 15 m
B22 10 f
"), header = TRUE, stringsAsFactors = FALSE)

dat$sex[match(dat$code, age.and.sex$code)] <- age.and.sex$sex
dat$age[match(dat$code, age.and.sex$code)] <- age.and.sex$age

dat
  code age sex more
1  A11  15   m    7
2  B22  10   f    4

The problem is that when I the values are not missing from all rows in the firest data frame, but I need to supplement only certain rows, I get an error.

Here is an example that does not work:

dat <- data.frame(code = c("A11", "B22", "C33"),
                  age = c(NA, NA, 12),
                  sex = c(NA, NA, "m"),
                  more = c(7, 4, 9),
                  stringsAsFactors = FALSE)

age.and.sex <- read.table(textConnection("
code age sex
A11 15 m
B22 10 f
"), header = TRUE, stringsAsFactors = FALSE)

dat$sex[match(dat$code, age.and.sex$code)] <- age.and.sex$sex
Error in dat$age[match(dat$code, age.and.sex$code)] <- age.and.sex$age : 
  NAs are not allowed in subscripted assignments
dat$age[match(dat$code, age.and.sex$code)] <- age.and.sex$age
Error in dat$age[match(dat$code, age.and.sex$code)] <- age.and.sex$age : 
  NAs are not allowed in subscripted assignments

I do not understand the error that R returns.

How do I need to change my code to get this to work again?

The result I'm trying to achieve is:

dat
  code age sex more
1  A11  15   m    7
2  B22  10   f    4
3  C33  12   m    9

Thank you all for your help so far. But I must admit that I'm not quite happy with your solutions.

What you suggest is a four step approach: add one data frame to the other, move the values to their desired destination one by one, then delete the now superfluous secondary columns. As a schema, your solution looks something like this:

enter image description here

That seems awfully complicated to me!

When I look at my data, the solution that seems obvious to me has only one step: cut the "matrix" from one data frame and paste it into the "empty" area of the other data frame. Here is what it looks like in my mind:

enter image description here

And apparently that is actually possible:

dat[1:2,2:3] <- age.and.sex[1:2,2:3]
dat
  code age sex more
1  A11  15   m    7
2  B22  10   f    4
3  C33  12   m    9

This, of course, only works if both data frames are ordered in the same way. Which is why I used match(), which overcomes the problem when the rows aren't ordered – but fails when the number of rows is not the same.

Or is there a way to match(), even if the number of rows is not the same?

  • You're doing `x[match(y,z)] <- w` but `match(y,z)` has some NAs in it, so it cannot assign. – Frank Mar 17 '16 at 14:52
  • 1
    I might approach this with more of a "left join and then coalesce/ifelse columns with missing values" kind of strategy. – joran Mar 17 '16 at 14:56
  • Yup, a simple `merge(x, y, all.x=TRUE)` gets you there, save for some cleanup. – cory Mar 17 '16 at 15:09
  • @cory `merge(dat, age.and.sex, all.x = TRUE)` is the same as just `dat` (if you try it with my second example). Could you maybe post an answer with the solution you are hinting at? I am unable to recreate it. –  Mar 17 '16 at 20:19
  • 1
    @what... I typed that wrong... forgot the `by` part. Should read `merge(dat, age.and.sex, by="code", all.x=TRUE)` – cory Mar 17 '16 at 20:22
  • Ah, I see. Thanks, @cory. –  Mar 17 '16 at 20:28
  • Now reading your revised question. See @Arun's comment on my answer. The same "steps" are there under the hood, but in a way this is as good as you're going to get: as you surmise, you're not going to avoid computing a join if the data aren't already ordered. The idea of just "cutting and pasting" is probably misleading. If you really do have data in this form frequently, and you know in advance exactly the cut-and-paste, it's a different question. – Philip Mar 17 '16 at 21:07
  • In that case, one approach probably is just to use `rbind` and `cbind`, but `data.table` is so fast (thanks to @Arun's work!) that this may be a false effort to improve efficiency. (Specifically re: rbind / cbind, you can do something like `cbind(rbind(the small table, the add'l rows of the first 2 columns of the big table),the last column of the big table))` I'm guessing this wouldn't be much faster; you could try testing on real data with `microbenchmark` and reporting back. Either way it looks less robust than `data.table`'s versatile join system; versatility can be worth a lot in speed. – Philip Mar 17 '16 at 21:08
  • @Philip Computing speed is not a problem for me. What I have found difficult to understand is joins. I have avoided learning them since I began programming PHP with MySQL, and I keep avoiding them with R :-) I feel my head is just not made for joins ... But back on topic, even if I would actually be able to force myself to learn joins, I'm still wondering whether there is no solution using `match()` or `merge()` with unequal and unordered rows. –  Mar 17 '16 at 21:15
  • 1
    This seems to work: `cbind(code=dat[,1],rbind(age.and.sex[,2:3],dat[3,2:3]),more=dat[,4])` – Philip Mar 17 '16 at 21:15
  • 1
    Don't avoid learning joins!! The thing to realize is that even if you do find a solution using `match`, you're basically manually computing the join anyway, and then looking up the joined columns. So essentially you're re-inventing the wheel. Just think about the logic of your original `match` solution: "Find the row in `age.and.sex` where the `code` column matches that in `dat`, then take the corresponding other columns from `age.and.sex`." Well this is exactly how a join (or `merge`) works: lookup the columns from the joined table where the `by` column(s) match(es) in the first table. – Philip Mar 17 '16 at 21:27

1 Answers1

1

You're trying to assign to these three rows:

> match(dat$code, age.and.sex$code)
[1]  1  2 NA

because dat$code and age.and.sex$code are not the same length, so the third comparison is NA.

I'm not sure what you actually mean to be matching, but you might just try subsetting to the first two observations, or na.omit, etc.

But a better way to join data from two tables is to use a join.

library(data.table)
dat <- data.table(dat)
setkey(dat,code)
age.and.sex <- data.table(age.and.sex)
setkey(age.and.sex,code)
dat[age.and.sex]
> dat[age.and.sex]
   code age sex more i.age i.sex
1:  A11  NA   m    7    15     m
2:  B22  NA   f    4    10     f

Note how the columns of the inner table get appended to those of the outer table.

More... Per @joran's suggestion...you can use this technique to fill in missing observations:

joined <- dat[age.and.sex]
joined[is.na(age),age:=i.age] #only replace the value missing from left table
joined[,c("i.age","i.sex"):=NULL]
joined
> joined
   code age sex more
1:  A11  15   m    7
2:  B22  10   f    4

Update to address your comment...just reverse the join. There are some cleverer ways to do this less manually, but this should be simple to follow:

joined <- age.and.sex[dat]
joined[is.na(age),age:=i.age]
joined[is.na(sex),sex:=i.sex]
joined[,c("i.age","i.sex"):=NULL]
> joined
   code age sex more
1:  A11  15   m    7
2:  B22  10   f    4
3:  C33  12   m    9

If this technique is to your liking you should definitely read ?data.table and the related vignette to learn more about joins.

Philip
  • 7,253
  • 3
  • 23
  • 31
  • Incidentally I just updated the answer with a method that shows (what I consider to be) the best way to do that. Let me know if that solves your problem. – Philip Mar 17 '16 at 14:58
  • Why not directly *join+update*: `setDT(dat)[age.and.sex, c("age", "sex") := .(i.age, i.sex), on="code"][]`? – Arun Mar 17 '16 at 20:16
  • Yes--very nice, the following is not a disagreement at all, and that suggestion is better...but as stated in the answer, while there are faster, less wordy, more elegant ways to do it, OP is clearly new to `data.table`, and I think seeing things done step by step where OP can trace the output in the REPL is more pedagogically sound. We don't know if OP's MWE is exactly OP's use case or just in fact a MWE with some aspects not representative of the original problem. – Philip Mar 17 '16 at 21:00
  • On the off chance it's the latter, OP may have an easier time generalizing to the actual problem from a more spelled-out example, and either way, OP will benefit from learning how your suggestion works as well. (Probably after learning the simpler forms.) Best – Philip Mar 17 '16 at 21:00
  • @Philip You are right, I am new to `data.table` and learned quite a bit from your example. But as you can see from my most recent edit to my question, I am thinking of data differently than you, i.e. more in a visual manner – as if it was printed on paper and I could just cut data from one table and glue it into the other. This hampers me – and it makes me want to do it differently ;-) –  Mar 17 '16 at 21:06