How do I do a train and test split in a polars dataframe

Question

I am trying to find a simple way of randomly splitting a polars dataframe in train and test. This is how I am doing it right now

train, test = df
  .with_columns(pl.lit(np.random.rand(df0.height)>0.8).alias('split'))
  .partition_by('split')

however, this leaves an extra split column hanging in my dataframes that I need to drop after.

There is an open request for `.partition_by` to drop the columns: https://github.com/pola-rs/polars/issues/8808 - Could you shuffle the dataframe: `df.sample(fraction=1, shuffle=True)` and take `.head()` + `.tail()` or do you need random sized splits? — jqurious, Jun 09 '23 at 21:32
would you mind cleaning up and adding your comment as an answer so I can mark it — ste_kwr, Jun 10 '23 at 15:14
Yep sure thing, I wasn't sure if it answered the question properly or not. — jqurious, Jun 10 '23 at 15:28
https://stackoverflow.com/a/76546689/ may be a more suitable answer? — jqurious, Jun 25 '23 at 00:28

score 1 · Accepted Answer · answered Jun 10 '23 at 15:25

There is an open feature request for allowing .partition_by to drop keys.

As discussed in the comments, it is possible to shuffle a dataframe using .sample()

df = pl.DataFrame({"val": range(100)})

df = df.sample(fraction=1, shuffle=True)

shape: (100, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 64  │
│ 40  │
│ 39  │
│ 98  │
│ …   │
│ 21  │
│ 29  │
│ 87  │
│ 99  │
└─────┘

Which could then be split into parts e.g. using .head and .tail

test_size = 20
test, train = df.head(test_size), df.tail(-test_size)

>>> test
shape: (20, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 60  │
│ 24  │
│ 96  │
│ 94  │
│ …   │
│ 50  │
│ 54  │
│ 56  │
│ 33  │
└─────┘

>>> train
shape: (80, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 87  │
│ 38  │
│ 6   │
│ 37  │
│ …   │
│ 93  │
│ 77  │
│ 8   │
│ 23  │
└─────┘

1 Answers1