2

To Whom It May Concern,

The code below is being run in a Docker container based on jupyter's data science notebook; however, I've install Java 8 and h2o (version 3.20.0.7), as well as exposed the necessary ports. The docker container is being run on a system using Ubuntu 16.04 and has 32 threads and over 300G of RAM.
h2o is using all the threads and 26.67 Gb of memory. I'm attempted to classify text as either a 0 or a 1 using the code below.
However, despite setting max_runtime_secs to 900 or 15 minutes, the code hadn't finished executing and was still tying up most of the machine resources ~15 hours later. As a side note, it took df_train about 20 minutes to parse. Any thoughts on what's going wrong?

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
    from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

    df = pd.read_csv('Data.csv')[['Text', 'Classification']]

    vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
                                 ngram_range=(1, 3), stop_words = 'english')

    x_train_vec = vectorizer.fit_transform(df['Text'])
    y_train = df['Classification']

    import h2o
    from h2o.automl import H2OAutoML
    h2o.init()

    df_train = h2o.H2OFrame(x_train_vec.A, header=-1, column_names=vectorizer.get_feature_names())
    df_labels = h2o.H2OFrame(y_train.reset_index()[['Classification']])
    df_train = df_train.concat(df_labels)

    x_train_cn = df_train.columns
    y_train_cn = 'Classification'
    x_train_cn.remove(y_train_cn)

    df_train[y_train_cn] = df_train[y_train_cn].asfactor()

    h2o_aml = H2OAutoML(max_runtime_secs = 900, exclude_algos = ["DeepLearning"])

    h2o_aml.train(x = x_train_cn , y = y_train_cn, training_frame = df_train)

    lb = h2o_aml.leaderboard

    y_predict = h2o_aml.leader.predict(df_train.drop('Classification'))

    print('accuracy: {}'.format(accuracy_score(y_pred=y_predict, y_true=y_train)))
    print('precision: {}'.format(precision_score(y_pred=y_predict, y_true=y_train)))
    print('recall: {}'.format(recall_score(y_pred=y_predict, y_true=y_train)))
    print('f1: {}\n'.format(f1_score(y_pred=y_predict, y_true=y_train)))
Isac Moura
  • 5,940
  • 3
  • 13
  • 27
cjmobley
  • 31
  • 3

1 Answers1

2

This is a bug that has been fixed on master. If you want, you can try out the fix now on the nightly release, otherwise, it will be fixed in the next stable release of H2O, 3.22.

Erin LeDell
  • 8,704
  • 1
  • 19
  • 35
  • Hi @Erin LeDell, I'm using version 3.21.0.4438 of h2o and I'm still running into the same problem. I set `h2o_aml = H2OAutoML(max_runtime_secs = 300)` and then execute `h2o_aml.train(x = x_train_cn , y = y_train_cn, training_frame = df_train)`. However, half an hour later the code is still running. So, I'm not sure if the bug was actually fixed. I saw https://github.com/h2oai/h2o-3/commit/8f85bf0d74ba46b4ec71d32f26cabbf5eaead245, but I wouldn't think it would take over 25 minutes to train the final model. – cjmobley Oct 03 '18 at 19:49
  • I'm working on a text classification problem. I have ~10,000 examples. I originally used a CountVectorizer because it performed slightly better than a TfidfVectorizer in testing with other automl libraries. However, a CountVectorizer results in a ~10,000x150,000 ndarray whereas a TfidfVectorizer results in a ~10,000x15,000. After switching to the TfidfVectorizer H2O completes the automl process; however, when training for 5 minutes it results in an empty leaderboard, 15 minutes results in one model on the lb. I'm using h2o version 3.21.0.4446, 80 threads, and 500 gb of RAM. – cjmobley Oct 11 '18 at 17:30
  • I am trying python version of 3.22.0.1, seems it still has problem of not respecting the max_runtime_secs, anybody has update on python version ? – Chandu Nov 27 '18 at 04:54
  • 1
    The `max_runtime_secs` does not currently include the Stacked Ensemble building at the end of the run, so it's possible that it can go over (we may change this in the future). If you find that there are bigger issues, please file a reproducible bug report here and include your logs: https://0xdata.atlassian.net/projects/PUBDEV – Erin LeDell Dec 03 '18 at 14:46