-2

consider the following data.frame:

> head(dtrain)
  content_id item_age   item_ctr likes clicks no_clicks event
1   11201926   461540 0.02787456     1     24       837     0
2   11201926   462497 0.02784223     1     24       838     0
3   11201926   473215 0.02780997     1     24       839     0
4   11201926   532983 0.02777778     1     24       840     0
5   11201926   536696 0.02774566     1     24       841     0
6   11201926   545545 0.02771363     1     24       842     0

I want to split the data by content_id which only requires the following command

result <- split(dtrain , f = dtrain$content_id )

But then I want to preserve only the data from dtrain where content_id had at list 1000 appearances (in dtrain). In other words, where the same content_id was present in dtrain more then 1000 times.

In the end, I will have split data by content_id where each split will have at list 1000 occurrences (because that's the aggregated condition)

Sotos
  • 51,121
  • 6
  • 32
  • 66
Eran Moshe
  • 3,062
  • 2
  • 22
  • 41

1 Answers1

3

You can first filter your data frame using dplyr to retain only those content groups with 1000 or more records:

temp <- dtrain
    %>% group_by(content_id)
    %>% filter(n() >= 1000)

and then continue as you were:

result <- split(temp, f=temp$content_id)
Eran Moshe
  • 3,062
  • 2
  • 22
  • 41
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360