0

I recently started exploring the field of Data Engineering and came across some difficulties. I have a bucket in GCS with millions of parquet files and I want to create an Anomaly Detection model with them. I was trying to ingest that data into Databricks and store it all in a Delta Table, using Autoloader, and then using H2O Sparkling Water for the Anomaly Detection model.

I have run into some memory issues when I read the Delta Table into a Pyspark Dataframe and try to pass it to an H2O Frame.

I'm using a cluster with 30 GB of memory and 4 cores and the Dataframe has shape (30448674, 213). The error I'm getting is the following:

The spark driver has stopped unexpectedly and is restarting. Your notebook will be reattached automatically.

Through an analysis in Ganglia UI my driver graph is always in red when executing the asH2OFrame() function.

This is the code I am using:

# Imports

import h2o

from pysparkling import *

from pyspark.sql import SparkSession

# Creates an H2OContext Object.

hc = H2OContext.getOrCreate(spark)

# Read the Data from Delta Table to PySpark Dataframe

df = spark.read.format("delta").load("path/to/delta_table")

# Converting the PySpark Dataframe to an H2OFrame

h2o_df = hc.asH2OFrame(df)

# H2O Model for Anomaly Detection

isolation_model = H2OIsolationForestEstimator(model_id,
                                              ntrees,
                                              seed)

isolation_model.train(training_frame = h2o_df)

# Making predictions

predictions = isolation_model.predict(h2o_df)
drupal2me
  • 69
  • 4

0 Answers0