I am trying to use MLJ on a DataFrame (30,000 rows x 8,000 columns) but every table operation seems to take a huge amount of time to compile but is fast to run.
I have given an example with code below in which a 5 x 5000 DataFrame is generated and it gets stuck on the unpack line (line 3). When I run the same code for a 5 x 5 DataFrame, line 3 outputs “2.872309 seconds (9.09 M allocations: 565.673 MiB, 6.47% gc time, 99.84% compilation time)”.
This is a crazy amount of compilation time for a seemingly simple task and I would like to know how I can reduce this. Thank you, Jack
using MLJ
using DataFrames
[line 1] @time arr = [[rand(1:10) for i in 1:5] for i in 1:5000];
output: 0.053668 seconds (200.76 k allocations: 11.360 MiB, 22.16% gc time, 99.16% compilation time)
[line 2] @time df = DataFrames.DataFrame(arr, :auto)
output: 0.267325 seconds (733.43 k allocations: 40.071 MiB, 4.29% gc time, 98.67% compilation time)
[line 3] @time y, X = unpack(df, ==(:x1));
does not finish running