5

Suppose I have the following dataframe

using DataFrames
df = DataFrame(A = 1:10, B = ["a","a","b","b","b","c","c","c","c","d"])
grouped_df  = groupby(df, "B")

I would have four groups. How can I drop the groups that have fewer than, say, 2 rows? For example, how can I keep only groups a,b, and c? I can easily do it with a for loop, but I don't think the optimal way.

user1691278
  • 1,751
  • 1
  • 15
  • 20

2 Answers2

4

If you want the result to be still grouped then filter is simplest:

julia> filter(x -> nrow(x) > 1, grouped_df)
GroupedDataFrame with 3 groups based on key: B
First Group (2 rows): B = "a"
 Row │ A      B
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     2  a
⋮
Last Group (4 rows): B = "c"
 Row │ A      B
     │ Int64  String
─────┼───────────────
   1 │     6  c
   2 │     7  c
   3 │     8  c
   4 │     9  c

If you want to get a data frame as a result of one operation then do e.g.:

julia> combine(grouped_df, x -> nrow(x) < 2 ? DataFrame() : x)
9×2 DataFrame
 Row │ B       A
     │ String  Int64
─────┼───────────────
   1 │ a           1
   2 │ a           2
   3 │ b           3
   4 │ b           4
   5 │ b           5
   6 │ c           6
   7 │ c           7
   8 │ c           8
   9 │ c           9
Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107
  • 1
    I wanted to keep it as grouped, but when I ran your first command, I get `ERROR: MethodError: no method matching filter(::var"#15#16", ::GroupedDataFrame{DataFrame})` – user1691278 Mar 04 '21 at 23:45
  • Please make sure that you are using the latest version of DataFrames.jl, which is 0.22.5 at the moment of writing. This post describes ways to check it: https://bkamins.github.io/julialang/2021/02/27/pkg_version.html and in this post https://bkamins.github.io/julialang/2020/05/11/package-version-restrictions.html you have information how to check what potentially blocks DataFrames.jl to be installed in its latest version in your project environment. – Bogumił Kamiński Mar 05 '21 at 06:42
0

I think a better way is to use subset:

subset(grouped_df, :B => x -> length(x) >= 2)

If you want to keep the groups, then simply set ungroup = false. You can also avoid the copying performed by combine by setting view = true or do the operation in-place with subset!.