How to balance a dataset from a countmap table

Question

I have this dataset:

text               sentiment
randomstring        positive
randomstring        negative
randomstring        netrual
random              mixed

Then if I run a countmap i have:

"mixed"    -> 600
"positive" -> 2000
"negative" -> 3300
"netrual"  -> 780

I want to random sample from this dataset in a way that I have records of all smallest class (mixed = 600) and the same amount of each of other classes (positive=600, negative=600, neutral = 600)

I know how to do this in pandas:

df_teste = [data.loc[data.sentiment==i]\
            .sample(n=int(data['sentiment']
            .value_counts().nsmallest(1)[0]),random_state=SEED) for i in data.sentiment.unique()]

df_teste = pd.concat(df_teste, axis=0, ignore_index=True)

But I am having a hard time to do this in Julia.

Note: I don´t want to hardcode which of the class is the lowest one, so I am looking for a solution that infer that from the countmap or freqtable, if possible.

score 3 · Accepted Answer · answered Jul 11 '22 at 21:06

Why do you want a countmap or freqtable solution if you seem do want to use a data frame in the end?

This is how you would do this with DataFrames.jl (but without StatsBase.jl and FreqTables.jl as they are not needed for this):

julia> using Random

julia> using DataFrames

julia> df = DataFrame(text = [randstring() for i in 1:6680],
                              sentiment = shuffle!([fill("mixed", 600);
                                                    fill("positive", 2000);
                                                    fill("ngative", 3300);
                                                    fill("neutral", 780)]))
6680×2 DataFrame
  Row │ text      sentiment
      │ String    String
──────┼─────────────────────
    1 │ R3W1KL5b  positive
    2 │ uCCpNrat  ngative
    3 │ fwqYTCWG  ngative
  ⋮   │    ⋮          ⋮
 6678 │ UJiNrlcw  ngative
 6679 │ 7aiNOQ1o  neutral
 6680 │ mbIOIQmQ  ngative
           6674 rows omitted

julia> gdf = groupby(df, :sentiment);

julia> min_len = minimum(nrow, gdf)
600

julia> df_sampled = combine(gdf) do sdf
           return sdf[randperm(nrow(sdf))[1:min_len], :]
       end
2400×2 DataFrame
  Row │ sentiment  text
      │ String     String
──────┼─────────────────────
    1 │ positive   O0QsyrJZ
    2 │ positive   7Vt70PSh
    3 │ positive   ebFd8m4o
  ⋮   │     ⋮         ⋮
 2398 │ neutral    Kq8Wi2Vv
 2399 │ neutral    yygOzKuC
 2400 │ neutral    NemZu7R3
           2394 rows omitted

julia> combine(groupby(df_sampled, :sentiment), nrow)
4×2 DataFrame
 Row │ sentiment  nrow
     │ String     Int64
─────┼──────────────────
   1 │ positive     600
   2 │ ngative      600
   3 │ mixed        600
   4 │ neutral      600

If your data is very large and you need the operation to be very fast there are more efficient ways to do it, but in most situations this should be fast enough and the solution does not require any extra packages.

How to balance a dataset from a countmap table

1 Answers1