2

I am trying to find a simple way of randomly splitting a polars dataframe in train and test. This is how I am doing it right now

train, test = df
  .with_columns(pl.lit(np.random.rand(df0.height)>0.8).alias('split'))
  .partition_by('split')

however, this leaves an extra split column hanging in my dataframes that I need to drop after.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
ste_kwr
  • 820
  • 1
  • 5
  • 21
  • There is an open request for `.partition_by` to drop the columns: https://github.com/pola-rs/polars/issues/8808 - Could you shuffle the dataframe: `df.sample(fraction=1, shuffle=True)` and take `.head()` + `.tail()` or do you need random sized splits? – jqurious Jun 09 '23 at 21:32
  • ah shuffle would work, thanks! – ste_kwr Jun 09 '23 at 21:43
  • would you mind cleaning up and adding your comment as an answer so I can mark it – ste_kwr Jun 10 '23 at 15:14
  • Yep sure thing, I wasn't sure if it answered the question properly or not. – jqurious Jun 10 '23 at 15:28
  • https://stackoverflow.com/a/76546689/ may be a more suitable answer? – jqurious Jun 25 '23 at 00:28

1 Answers1

1

There is an open feature request for allowing .partition_by to drop keys.

As discussed in the comments, it is possible to shuffle a dataframe using .sample()

df = pl.DataFrame({"val": range(100)})

df = df.sample(fraction=1, shuffle=True)
shape: (100, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 64  │
│ 40  │
│ 39  │
│ 98  │
│ …   │
│ 21  │
│ 29  │
│ 87  │
│ 99  │
└─────┘

Which could then be split into parts e.g. using .head and .tail

test_size = 20
test, train = df.head(test_size), df.tail(-test_size)
>>> test
shape: (20, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 60  │
│ 24  │
│ 96  │
│ 94  │
│ …   │
│ 50  │
│ 54  │
│ 56  │
│ 33  │
└─────┘
>>> train
shape: (80, 1)
┌─────┐
│ val │
│ --- │
│ i64 │
╞═════╡
│ 87  │
│ 38  │
│ 6   │
│ 37  │
│ …   │
│ 93  │
│ 77  │
│ 8   │
│ 23  │
└─────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14