0

I am trying to use MLJ on a DataFrame (30,000 rows x 8,000 columns) but every table operation seems to take a huge amount of time to compile but is fast to run.

I have given an example with code below in which a 5 x 5000 DataFrame is generated and it gets stuck on the unpack line (line 3). When I run the same code for a 5 x 5 DataFrame, line 3 outputs “2.872309 seconds (9.09 M allocations: 565.673 MiB, 6.47% gc time, 99.84% compilation time)”.

This is a crazy amount of compilation time for a seemingly simple task and I would like to know how I can reduce this. Thank you, Jack

using MLJ

using DataFrames

[line 1] @time arr = [[rand(1:10) for i in 1:5] for i in 1:5000];

output: 0.053668 seconds (200.76 k allocations: 11.360 MiB, 22.16% gc time, 99.16% compilation time)

[line 2] @time df = DataFrames.DataFrame(arr, :auto)

output: 0.267325 seconds (733.43 k allocations: 40.071 MiB, 4.29% gc time, 98.67% compilation time)

[line 3] @time y, X = unpack(df, ==(:x1));

does not finish running

Jack N
  • 324
  • 2
  • 14

2 Answers2

1

It's not unexpected that the Julia compiler struggles with very wide DataFrames, which have (potentially) heterogeneous column types. That said I'm not sure why this has to be a problem for this operation - I've checked with MLJ maintainers who can hopefully chime in.

In the meantime you can simply do

y, X = df.x1, select!(df, Not(:x1))

which is instantaneous (Note select! will drop x1 from your underlying data, if you want to copy data use select instead)

Nils Gudat
  • 13,222
  • 3
  • 39
  • 60
  • >Julia compiler struggles with very wide DataFrames, which have (potentially) heterogeneous column types. precisely the reason DataFrame doesn't carry the column types in its own type parameters. so it's mainly how MLJ does it that causes compilation hell – jling Jun 09 '22 at 15:06
0

Please don't cross-post a problem on multiple websites without linking.

The question has been answered at the Julia forum: https://discourse.julialang.org/t/simple-table-operation-has-very-large-compilation-time-with-mlj/82503/2. It was caused by a bug which is fixed in MLJBase 0.20.5.

RikH
  • 2,994
  • 1
  • 16
  • 15