I have a large set of data saved in a long list. This is an example of the first six records:
A <- list(c("JAMES","CHARLES","JAMES","RICHARD"),
c("JOHN","ROBERT","CHARLES"),
c("CHARLES","WILLIAM","CHARLES","MICHAEL","WILLIAM","DAVID","CHARLES","WILLIAM"),
c("CHARLES"),
c("CHARLES","CHARLES"),
c("MATTHEW","CHARLES","JACK"))
I would like to calculate the ratios of the sum of the relative frequency with which each unique term occurs in each record and the number of records each term appears in.
I calculate the numerator, i.e. the sum of the relative frequency with which each unique term occurs in each record, like this:
> B <- lapply(A, function(x)table(x)/length(x))
> aggregate(unlist(B), list(names(unlist(B))), FUN=sum)
Group.1 x
1 CHARLES 3.2916667
2 DAVID 0.1250000
3 JACK 0.3333333
4 JAMES 0.5000000
5 JOHN 0.3333333
6 MATTHEW 0.3333333
7 MICHAEL 0.1250000
8 RICHARD 0.2500000
9 ROBERT 0.3333333
10 WILLIAM 0.3750000
I'm not sure how to calculate the denominator, i.e. the number of records each term appears in, correctly, though. I only know how to calculate the number each term occurs in the data set:
> table(unlist(A))
CHARLES DAVID JACK JAMES JOHN MATTHEW MICHAEL RICHARD ROBERT WILLIAM
9 1 1 2 1 1 1 1 1 3
But some terms occur more than once within a record and I'd like to omit these repetitions in order to get a result like this:
CHARLES DAVID JACK JAMES JOHN MATTHEW MICHAEL RICHARD ROBERT WILLIAM
6 1 1 1 1 1 1 1 1 1
How can this be achieved?
Based on my example I would like to get a final output similar to this:
Group.1 x
1 CHARLES 0.5486111
2 DAVID 0.1250000
3 JACK 0.3333333
4 JAMES 0.5000000
5 JOHN 0.3333333
6 MATTHEW 0.3333333
7 MICHAEL 0.1250000
8 RICHARD 0.2500000
9 ROBERT 0.3333333
10 WILLIAM 0.3750000
So how can I calculate the number of records each term appears in, i.e. the denominator, and the ratio itself?
Thank you very much in advance for your consideration!