0

I like the h2o.ai tool for ml. It is java but it is familiar and does a decent job.

Here is info about stratified splitting in general:

I have a variable that is strongly imbalanced, so I need R-gui based stratified splitting of my data on that variable, in h2o.ai. Is there a way to do it?

An R command for splitting data in the h2o.ai tool is this:

splits = h2o.splitFrame(mydata, ratios=myratio, destination_frames=...)

There is no option for stratification in the splitframe variable. The I know in the Flow (web interface to running java) tool they allow balanced classes in the cross-validated approach, so somewhere in there it is doing stratified splitting.

I hate to do this in base R because the memory handling in R is not as effective as in h2o.ai and my data sizes are large.

EngrStudent
  • 1,924
  • 31
  • 46
  • I'm not sure you have given enough information to allow anyone to help you here. You do know there's an `h2o` package for R? – Allan Cameron Oct 02 '20 at 11:59
  • @AllanCameron - there is absolutely an h2o.ai package for r. It is called h2o, but here on SO that is ambiguous because there is another package named h2o that doesn't have anything to do with machine learning. – EngrStudent Oct 02 '20 at 12:03
  • OK - in that case, what is it you are actually asking? You seem to have posed a conceptual question about stratified splitting of your data, without any concrete example of what you mean, then asking "is there a way to do it?!" The answer is "Yes, probably!". If you want a more detailed answer than that, we probably need a more detailed question. – Allan Cameron Oct 02 '20 at 12:09
  • @AllanCameron - I need to do it in the framework. It isn't conceptual at all. The framework is specified, the task is specified, the answer isn't on the web or in the help-docs (that I can find). – EngrStudent Oct 02 '20 at 12:11
  • EngrStudent your updated question makes things much clearer and makes your question much better. I have removed my downvote and close vote. Thank you – Allan Cameron Oct 02 '20 at 12:14
  • @EngrStudent I don't think you need to apply stratified sampling before training the model given `h2o` has different parameters for `fold_assignment` (one of them being Stratified). More on this [here](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/fold_assignment.html?highlight=stratified#example) – anddt Oct 02 '20 at 12:28
  • If I'm not doing folds, because each single compute run takes waaaaay longer than I can afford, then fold assignment doesn't help. I'm doing train/valid split because I have to, not because I want to. I prefer the 5-fold cv, but I can't afford it right this second. – EngrStudent Oct 02 '20 at 12:45

1 Answers1

1

As far as I understand your problem is to use stratified sampling since your data is heavily imbalanced

when creating model you can set certain args to achieve this, for example

h2o.gbm(....., nfolds=n, fold_asssignment="Stratified", fold_column="Your Column")

or else you can try setting

h2o.gbm(..., balance_classes=True, ...)

Hope this will help you, for more details please refer to https://docs.h2o.ai/h2o/latest-stable/h2o-r/h2o_package.pdf

Mathanraj-Sharma
  • 354
  • 1
  • 3
  • 7