0

I have a dataset "a" with a column "id" with about 23,000 rows, which are unique in this dataframe. I want to count the appearance frequency of these unique values in another two datasets "b" and "c". To do this, I tried the code:

count1 <- as.data.frame(apply(a,1,function(x)sum(b$id==x[45])))
a <- cbind(a,count1)

, since "id" is the 45th column in the dataframe "b". The code works for counting in b, but when I tried the same code for counting the frequency of "id" in dataframe "c":

count2 <- as.data.frame(apply(a,1,function(x)sum(c$id==x[17])))

"id" in dataframe "c" is in the 17th column. The frequencies of all "id"s are counted as 0, which is not the case it should be. Could anyone suggest where the problem is or how to fix this problem?

thelatemail
  • 91,185
  • 12
  • 128
  • 188
Ramona
  • 11
  • 1
  • 3
  • 1
    This is an excellent example of why it is a bad idea to present failed code; rather it is preferred to post a minimal example describing the goal and input and output. Usually one would expect a COUNTIF problem to be solved with sum(logical expression). – IRTFM Apr 29 '16 at 00:17

1 Answers1

2

We can actually do this in a way that might at first seem a little weird, but is relatively straight forward. Let's start by working with just data frames a and b and let's simplify things a bit. Let's assume that the id variable in both a and b are the following:

a_id <- 1:5
b_id <- 1:5

In this simple example, a_id and b_id are exactly identical. What we want to know is how many times each of the values in a_id shows up in b_id. We obviously know the answer is one time each, but how do we get R to tell us that? That's where the table function can come in handy:

table(a_id, b_id)
#     b_id
# a_id 1 2 3 4 5
#    1 1 0 0 0 0
#    2 0 1 0 0 0
#    3 0 0 1 0 0
#    4 0 0 0 1 0
#    5 0 0 0 0 1

That might look a little ugly, but you can see that we have our b_ids on the top (1-5) and our a_ids on the left-hand side. Down the diagonal, we see the counts for how many times each value of a_id shows up in b_id, and it's 1 each just like we already knew. So how do we get just that information? R has a nice function called diag that gets the main diagonal for us:

diag(table(a_id, b_id))
# 1 2 3 4 5 
# 1 1 1 1 1 

And there we have it. A vector with our "countif" values. But what if b_id doesn't have all of the values that are in a_id? If we try to do what we just did, we'll get an error because table doesn't like it when two vectors have different lengths. So we modify it a bit:

a_id <- 1:10
b_id <- 4:8
table(b_id[b_id %in% a_id])
# 4 5 6 7 8 
# 1 1 1 1 1 

A couple new things here. The use of %in% just asks R to tell us if a value exists in a vector. For example, 1 %in% 1:3 would return TRUE, but 4 %in% 1:3 would return FALSE. Next, you'll notice that we indexed b_id by using [. This only returns of the values of b_id where b_id %in% a_id is TRUE, which in this case is all of b_id.

So what does this look like if we expect more than one value of each a_id in b_id, but not every a_id value to be in b_id? Let's look at a more real example:

a_id <- 1:10
b_id <- sample(3:7, 1000, replace=TRUE)
table(b_id[b_id %in% a_id])
#   3   4   5   6   7 
# 210 182 216 177 215

Like I said, it might seem a little weird at first, but it's relatively straight forward. Hopefully this helps you more than it confuses you.

tblznbits
  • 6,602
  • 6
  • 36
  • 66