Batching large input file into MLlib model

Question

Is there any way to batch a large input file (111MB) made of 22 MLN cells (222 rows for 110k columns) in MLlib (something similar to this tutorial made in keras) Keras batching tutorial. The file contains the actual features extracted from 222 images using the above tutorial, but instead of using a keras model I would like to replicate such code using pyspark and MLlib.

Unfortunately I've not enough resources for dealing in memory for such big file and the computation fails for Java Heap Space memory error.

The file structure is composed by for each row (representing an image) we have these columns: "_c0" the label 0/1, from "_c1" up to "_c100353" features extracted.

Here's my code, I don't care about precision and accuracy, I'm just interested on running the model for making resource usage metrics.

sql,sc = init_spark()

df = sql.read.option("maxColumns", 100400).load(file3,format="csv",inferSchema="true",sep=',',header="false")


labelIndexer = StringIndexer(inputCol="_c0", outputCol="indexedLabel").fit(df)

cols = df.columns
cols.remove("_c0")
assembler = VectorAssembler(inputCols=cols,outputCol="features")
data = assembler.transform(df)
featureIndexer =\
        VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=100).fit(data)

(trainingData, testData) = data.randomSplit([0.7, 0.3])
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
#
#    # Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
#
#    # Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)
#
#    # Make predictions.
predictions = model.transform(testData)
#
#    # Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(100)
predictions.printSchema()

#
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = %g " % accuracy)

Please don't suggest me to use sparkdl library for features extraction using DeepImageFeaturizer because it's completely broken.

Batching large input file into MLlib model

0 Answers0