0

I have a large set of data saved in a long list. This is an example of the first six records:

A <- list(c("JAMES","CHARLES","JAMES","RICHARD"),  
c("JOHN","ROBERT","CHARLES"),  
c("CHARLES","WILLIAM","CHARLES","MICHAEL","WILLIAM","DAVID","CHARLES","WILLIAM"),  
c("CHARLES"),  
c("CHARLES","CHARLES"),  
c("MATTHEW","CHARLES","JACK"))  

I would like to calculate the ratios of the sum of the relative frequency with which each unique term occurs in each record and the number of records each term appears in.

I calculate the numerator, i.e. the sum of the relative frequency with which each unique term occurs in each record, like this:

> B <- lapply(A, function(x)table(x)/length(x))  
> aggregate(unlist(B), list(names(unlist(B))), FUN=sum)  
Group.1         x  
1  CHARLES 3.2916667  
2    DAVID 0.1250000  
3     JACK 0.3333333  
4    JAMES 0.5000000  
5     JOHN 0.3333333  
6  MATTHEW 0.3333333  
7  MICHAEL 0.1250000  
8  RICHARD 0.2500000  
9   ROBERT 0.3333333  
10 WILLIAM 0.3750000  

I'm not sure how to calculate the denominator, i.e. the number of records each term appears in, correctly, though. I only know how to calculate the number each term occurs in the data set:

> table(unlist(A))  

CHARLES   DAVID   JACK   JAMES    JOHN MATTHEW MICHAEL RICHARD  ROBERT WILLIAM  
   9       1       1       2       1       1       1       1       1       3  

But some terms occur more than once within a record and I'd like to omit these repetitions in order to get a result like this:

CHARLES   DAVID   JACK   JAMES    JOHN MATTHEW MICHAEL RICHARD  ROBERT WILLIAM  
   6       1       1       1       1       1       1       1       1       1  

How can this be achieved?
Based on my example I would like to get a final output similar to this:

Group.1         x  
1  CHARLES 0.5486111  
2    DAVID 0.1250000  
3     JACK 0.3333333  
4    JAMES 0.5000000  
5     JOHN 0.3333333  
6  MATTHEW 0.3333333  
7  MICHAEL 0.1250000  
8  RICHARD 0.2500000  
9   ROBERT 0.3333333  
10 WILLIAM 0.3750000  

So how can I calculate the number of records each term appears in, i.e. the denominator, and the ratio itself?

Thank you very much in advance for your consideration!

user0815
  • 115
  • 2
  • 8

2 Answers2

1

When aggregating, instead of sum, just use mean:

aggregate(unlist(B), list(names(unlist(B))), FUN=mean)  
#    Group.1         x
# 1  CHARLES 0.5486111
# 2    DAVID 0.1250000
# 3     JACK 0.3333333
# 4    JAMES 0.5000000
# 5     JOHN 0.3333333
# 6  MATTHEW 0.3333333
# 7  MICHAEL 0.1250000
# 8  RICHARD 0.2500000
# 9   ROBERT 0.3333333
# 10 WILLIAM 0.3750000
flodel
  • 87,577
  • 21
  • 185
  • 223
0
B <- lapply(A, unique)
B
table(unlist(B))

CHARLES   DAVID    JACK   JAMES    JOHN MATTHEW MICHAEL RICHARD  ROBERT WILLIAM 
      6       1       1       1       1       1       1       1       1       1 

From the earlier post (which you really should have cited user0815). Stick the unique inside that table call.

 BL <- lapply(A, function(x)table(unique(x))/length(x))
 ## turn list into a vector
 B <- unlist(BL)

 aggregate(B, list(names(B)), FUN=sum)
#------------
   Group.1         x
1  CHARLES 2.5416667
2    DAVID 0.1250000
3     JACK 0.3333333
4    JAMES 0.2500000
5     JOHN 0.3333333
6  MATTHEW 0.3333333
7  MICHAEL 0.1250000
8  RICHARD 0.2500000
9   ROBERT 0.3333333
10 WILLIAM 0.1250000
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Then, (hopefully not stating the obvious to the OP) assuming that the output of `aggregate` was called `out`, `out$rel <- out$x/table(unlist(B))` – A5C1D2H2I1M1N2O1R2T1 Sep 18 '12 at 16:11
  • Yes I was "hopeful". I got an error running his aggregate() call on either his A or my B. – IRTFM Sep 18 '12 at 16:13
  • This question is an exact copy of [this one](http://stackoverflow.com/questions/11546941/calculate-relative-frequency-of-list-terms-and-its-sum-in-r) with only one addition (`unique`), but the OP forgot to include that in their question. – A5C1D2H2I1M1N2O1R2T1 Sep 18 '12 at 16:15
  • Thank you for your comments; you are absolutely right about _unique_, of course. Then I'd accidentally deleted one line while posting which I edited back in. The real question is about a way to calculate the ratio, though. I only used the template of another question to illustrate my problem more clearly. – user0815 Sep 18 '12 at 16:20
  • @DWin First of all I'm sorry for failing to citing the other question. Thank you very much for letting me know and for your edit! Unfortunately, your solution produces a different result, though: _3.2916667 / 6 = 0.5486111_, not _2.5416667_. :-( – user0815 Sep 18 '12 at 16:32
  • @user0815, the answer is in my original comment. That comment starts where [sgibb's answer](http://stackoverflow.com/a/11547032/1270695) stops. – A5C1D2H2I1M1N2O1R2T1 Sep 18 '12 at 16:46