To Whom It May Concern,
The code below is being run in a Docker container based on jupyter's data science notebook;
however, I've install Java 8 and h2o (version 3.20.0.7), as well as exposed the necessary ports. The docker container is being run on a system using Ubuntu 16.04 and has 32 threads and over 300G of RAM.
h2o is using all the threads and 26.67 Gb of memory. I'm attempted to classify text as either a 0 or a 1 using the code below.
However, despite setting max_runtime_secs to 900 or 15 minutes, the code hadn't finished executing and was still tying up most of the machine resources ~15 hours later. As a side note, it took df_train about 20 minutes to parse. Any thoughts on what's going wrong?
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
df = pd.read_csv('Data.csv')[['Text', 'Classification']]
vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 3), stop_words = 'english')
x_train_vec = vectorizer.fit_transform(df['Text'])
y_train = df['Classification']
import h2o
from h2o.automl import H2OAutoML
h2o.init()
df_train = h2o.H2OFrame(x_train_vec.A, header=-1, column_names=vectorizer.get_feature_names())
df_labels = h2o.H2OFrame(y_train.reset_index()[['Classification']])
df_train = df_train.concat(df_labels)
x_train_cn = df_train.columns
y_train_cn = 'Classification'
x_train_cn.remove(y_train_cn)
df_train[y_train_cn] = df_train[y_train_cn].asfactor()
h2o_aml = H2OAutoML(max_runtime_secs = 900, exclude_algos = ["DeepLearning"])
h2o_aml.train(x = x_train_cn , y = y_train_cn, training_frame = df_train)
lb = h2o_aml.leaderboard
y_predict = h2o_aml.leader.predict(df_train.drop('Classification'))
print('accuracy: {}'.format(accuracy_score(y_pred=y_predict, y_true=y_train)))
print('precision: {}'.format(precision_score(y_pred=y_predict, y_true=y_train)))
print('recall: {}'.format(recall_score(y_pred=y_predict, y_true=y_train)))
print('f1: {}\n'.format(f1_score(y_pred=y_predict, y_true=y_train)))