3

assuming i have a dataframe that look like so:

    category  type
[1] A        green
[2] A        purple
[3] A        orange
[4] B        yellow
[5] B        green
[6] B        orange
[7] C        green

How do I get a list containing those types that appear in each category? In this case it should look like:

    type
[1] green

I know that this question is basic, and probably has been asked by someone else before; but my method is too long and I'm sure there's a more efficient way of doing it: I used to split the dataframe based on category, and do the set intersection. Is there a better way please? thanks!

Jaap
  • 81,064
  • 34
  • 182
  • 193
Tavi
  • 2,668
  • 11
  • 27
  • 41
  • Do you know the number of categories, and is there any possibility that a type may appear more than once in a category? From your example above you know that green appears in each category because it appears 3 times, but I'm not sure if this will hold true for your actual data. – Joe Jan 28 '15 at 21:16
  • @Joe I'm guaranteed that a type will not appear more than once in any given category – Tavi Jan 28 '15 at 21:17

6 Answers6

4

Assuming a type appears in a category at most once (otherwise change the == to >=) and using table you could try the following:

 colnames(table(df))[colSums(table(df)) == length(unique(df$category))]
[1] "green"
DatamineR
  • 10,428
  • 3
  • 25
  • 45
3

Here's one approach using data.table - provided that type only appears at most once per category:

library(data.table)
DT <- data.table(DF)
##
R> DT[
    ,list(
      nCat=.N
    ),by=type][
      nCat==length(unique(DT$category)),
      type]
[1] "green"

All this does it aggregate the original data as a count of rows by type (nCat), and then subset that result by taking the rows where nCat is equal to the unique number of categories in DT.

Edit: Thanks to @Arun, this can be done more concisely with a newer version of data.table by taking advantage of the uniqueN function:

unique(dt)[, .N, by=type][N == uniqueN(dt$category), type]

If you aren't guaranteed that type will appear at most once per category, you make make a slight modification to the above:

R> DT[
    ,list(
      nCat=length(unique(category))
    ),by=type][
      nCat==length(unique(DT$category)),
      type]
[1] "green" 

Data:

DF <- read.table(
  text="category  type
A        green
A        purple
A        orange
B        yellow
B        green
B        orange
C        green",
  header=TRUE,
  stringsAsFactors=F)
nrussell
  • 18,382
  • 4
  • 47
  • 60
  • Care to explain the code for those who don't know `data.table` syntax? – nico Jan 28 '15 at 21:22
  • 2
    @nico Yes, sorry I was expanding on my answer as you commented. – nrussell Jan 28 '15 at 21:27
  • 1
    Nice! Just another way: `unique(dt)[, .N, by=type][N == uniqueN(dt$category), type]`. `uniqueN` is new, in 1.9.5, which is a faster version of `length(unique(.))`. – Arun Jan 28 '15 at 21:39
  • 1
    @Arun Thank you! The machine I'm on at the moment has an older version of R / `data.table`, but I will definitely make use of that on my other computer. – nrussell Jan 28 '15 at 21:47
2

I couldn't really find a super-obvious solution, however this does the job.

df <- data.frame(category=c("A", "A", "A", "B", "B", "B", "C"), 
                 type=c("green", "purple", "orange", "yellow", 
                        "green", "orange", "green"))

# Split the data frame by type
# This gives a list with elements corresponding to each type
types <- split(df, df$type)

# Find the length of each element of the list
len <- sapply(types, function(t){length(t$type)})

# If the length is equal to the number of categories then 
# the type is present in all categories 
res <- names(which(len==length(unique(df$category))))

Note that sapply will return the types as names of the vector, hence the call to names in the next statement.

nico
  • 50,859
  • 17
  • 87
  • 112
2

If df is your data.frame, here is 'one' line of code thanks to Reduce:

x = df$category
y = df$type

Reduce(intersect, lapply(unique(x), function(u) y[x==u]))
#[1] "green"
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87
2

One way would be to make a table and either select the types that appear the number of times that each category appears (3 in this case), or since you say it can only appear once, just take the mean and select the mean == 1 (or >= 1).

dat <- read.table(header = TRUE, text="category  type
A        green
A        purple
A        orange
B        yellow
B        green
B        orange
C        green")

tbl <- data.frame(with(dat, ftable(category, type)))
tbl[with(tbl, ave(Freq, type)) >= 1, ]

#   category  type Freq
# 1        A green    1
# 2        B green    1
# 3        C green    1

unique(tbl[with(tbl, ave(Freq, type)) >= 1, 'type'])
# [1] green
rawr
  • 20,481
  • 4
  • 44
  • 78
1

Assuming your data are in df:

df.sum <- aggregate(df$tpye, by = list(df$type), FUN = length)
types <- df.sum[which(df$sum == length(unique(df$x))),]

This will count the number of appearances in each type, and see which ones appear as many times as you have categories. If types don't appear more than once in a category, it will effectively do what you want, though it will not work if that assumption is violated.

Joe
  • 3,831
  • 4
  • 28
  • 44
  • hi joe, please whats the package for the aggregate function? thanks! – Tavi Jan 28 '15 at 21:33
  • @maryam it's loaded by default, it comes from the `stats` package. How do I know? In R I typed `?aggregate`. If it did not come out with any result I could have tried `??aggregate` and if that did not work either `RSiteSearch("aggregate")`! – nico Jan 28 '15 at 21:37