I have a dataset of 6m+ rows and about 300 columns that I am currently pre-processing with dask in Python. I'm building a classifier and there is a severe class imbalance that I would normally handle using sampling methods through imblearn (random oversampling, undersampling, SMOTE, etc.) but I've read that dask doesn't work with imblearn since all the data needs to be read into memory first. Is there an equivalent method that would work with dask or would I need to do this with other big data tools like Spark?
Asked
Active
Viewed 337 times
2
-
It looks like handling imbalanced datasets is still [a work in progress on dask-ml](https://github.com/dask/dask-ml/issues/317) – scj13 Mar 18 '22 at 00:11