0

I have email data that I'd like to compare. To gather the data I had to pull sends, deliveries, opens, etc. from separate tables, so I essentially have 5 data frames with identical information, but the send table has 100% of user ID's that were mailed. The delivery, open, etc. tables contain the exact same variables/columns, but just less and less of the rows since not everyone that was sent an email opened it, clicked it, etc.

What I'd like to do is combine all of this into the sent data frame, creating a new column that has a Y/N for each user saying whether that user received the email, opened it, clicked, by comparing whether the USER ID exists in the subsequent tables. So sort of a semi-join, but I just want to create a new column in the 1st table that says whether each unique USER ID exists in the other tables or not. Simplified example, I have the first column from each of the two tables below.

Sent       USER ID  1  3  17  26  35  124
           Deliv?   Y  N  Y   N   Y   Y
Delivered  USER ID  1  17  35  124

Tried using mutate with match, then with ifelse, but no dice so far.

Thanks!

1 Answers1

0

Try the following.

x <- scan(text = "1  3  17  26  35  124")
y <- scan(text = "1  17  35  124")
z <- factor(x %in% y, labels = c("N", "Y"))
z
#[1] Y N Y N Y Y
#Levels: N Y

Of course, the names of your variables will be different. But the method is this one.

EDIT.
Here is another way, if you don't want a result of class factor.

z2 <- c("N", "Y")[(x %in% y) + 1L]
z2
#[1] "Y" "N" "Y" "N" "Y" "Y"

EDIT 2.
Note that instead of TRUE/FALSE it's possible to convert the logical values into binary integers in a number of ways. The more readable one would be to use as.integer.

z <- factor(as.integer(x %in% y), labels = c("N", "Y"))

[Two other ways are hacks, to add zero or to multiply by one: (x %in% y) + 0L or (x %in% y)*1L.]

Then you would create a new column in data.frame sent with the result of this.

sent$Deliv <- z    # or z2

Or even more simple, don't create the intermediate variable z (or z2) and assign factor(...) directly.

sent$Deliv <- factor(as.integer(x %in% y), labels = c("N", "Y"))
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thanks for the suggestion. The only challenge here, which I should have mentioned, is that I'm working with 800K records, so it's not possible to type in the values one by one. Is there a method to do this when I can just name the column? Or at least pull them into vectors and work them that way? – Gopher_Gold25 Nov 06 '17 at 16:53
  • Is the labeling `N/Y` correct? If so you can always name the columns, the variables `x` and `y` above would be considered the names of those columns. You could do `Sent$USER_ID %in% Delivered$USER_ID`. – Rui Barradas Nov 06 '17 at 17:33
  • Rui, Wwhen I said 'name the column' I meant reference the column. My ideal solution would simply append one new column to the 'sent' file, and if the User ID (row) in 'sent' had a match in 'delivered' then a Y would be input in the new column for that User ID. If the User ID from 'sent did not have a match in the delivered file, then that row would get an N value. Does that make more sense? It looks like after using your %in% function, I'm getting TRUE/FALSE, which will work instead of Y/N. So it appears your solution works, but just want to confirm it's doing what I think it is. – Gopher_Gold25 Nov 10 '17 at 23:13
  • @Gopher_Gold25 If you're getting `TRUE/FALSE` then it's working the way it should. You can transform it to integer binary values if you want, see my edit. But in order to properly label those values you will need to convert to factor, with labels `Y/N`. – Rui Barradas Nov 11 '17 at 11:25