split data by "column" with an aggregated condition

Question

consider the following data.frame:

> head(dtrain)
  content_id item_age   item_ctr likes clicks no_clicks event
1   11201926   461540 0.02787456     1     24       837     0
2   11201926   462497 0.02784223     1     24       838     0
3   11201926   473215 0.02780997     1     24       839     0
4   11201926   532983 0.02777778     1     24       840     0
5   11201926   536696 0.02774566     1     24       841     0
6   11201926   545545 0.02771363     1     24       842     0

I want to split the data by content_id which only requires the following command

result <- split(dtrain , f = dtrain$content_id )

But then I want to preserve only the data from dtrain where content_id had at list 1000 appearances (in dtrain). In other words, where the same content_id was present in dtrain more then 1000 times.

In the end, I will have split data by content_id where each split will have at list 1000 occurrences (because that's the aggregated condition)

score 3 · Accepted Answer · edited Oct 03 '17 at 07:16

3

You can first filter your data frame using dplyr to retain only those content groups with 1000 or more records:

temp <- dtrain
    %>% group_by(content_id)
    %>% filter(n() >= 1000)

and then continue as you were:

result <- split(temp, f=temp$content_id)

edited Oct 03 '17 at 07:16

Eran Moshe

3,062
2
22
41

answered Oct 03 '17 at 06:53

Tim Biegeleisen

502,043
27
286
360

Thanks for the elegant solution! – Eran Moshe Oct 03 '17 at 07:16

split data by "column" with an aggregated condition

1 Answers1