1

I am running my ALS program on spark cluster of two nodes in pyspark.It is working fine for 20 iterations if I disable checkpointIntervalin als params.For more than 20 iterations it requires to enable CheckpointInterval.I have also given a checkpoint directory.It is giving me the following error.I am not getting the issue properly with it.

The same program worked fine on single machine with 25 iterations.

My error is :

 Py4JJavaError: An error occurred while calling o2574.fit.
 : org.apache.spark.SparkException: Checkpoint RDD has a different number
of partitions from original RDD. Original RDD [ID: 3265, num of 
partitions:10]; Checkpoint RDD [ID: 3266, num of partitions: 0].

My code is :

 import time
 start = time.time()

 from pyspark.sql import SparkSession


 spark=SparkSession.builder.master('spark://172.16.12.200:7077')
.appName('new').getOrCreate()

 ndf = spark.read.json("Musical_Instruments_5.json")
 pd=ndf.select(ndf['asin'],ndf['overall'],ndf['reviewerID'])


 spark.sparkContext.setCheckpointDir("/home/npproject/jupyter_files /checkpoints")

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit,ParamGridBuilder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

indexer = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in list(set(pd.columns)-set(['overall'])) ]

pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(pd).transform(pd)
(training,test)=transformed.randomSplit([0.8, 0.2])
als=ALS(maxIter=25,regParam=0.09,rank=25,userCol="reviewerID_index",
itemCol="asin_index",ratingCol="overall",
checkpointInterval=5,coldStartStrategy="drop",
checkpointInterval=-1,nonnegative=True)
model=als.fit(training)
evaluator=RegressionEvaluator(metricName="rmse",
labelCol="overall",predictionCol="prediction")
predictions=model.transform(test)
rmse=evaluator.evaluate(predictions)
print("RMSE="+str(rmse))
print("Rank: ",model.rank)
print("MaxIter: ",model._java_obj.parent().getMaxIter())
print("RegParam: ",model._java_obj.parent().getRegParam())

user_recs=model.recommendForAllUsers(10).show(20)

end = time.time()
print("execution time",end-start)
Neha patel
  • 143
  • 2
  • 12
  • Possible duplicate of [Checkpoint RDD ReliableCheckpointRDD has different number of partitions from original RDD](https://stackoverflow.com/questions/33238882/checkpoint-rdd-reliablecheckpointrdd-has-different-number-of-partitions-from-ori) – 10465355 Feb 07 '19 at 11:46
  • Hi, I have tried to mount a directory for all worker node and when I give mounted checkpoint directory path in my program it is showing as invalid checkpoint directory.Can you tell me correct way of mounting for cluster. – Neha patel Feb 08 '19 at 11:04

0 Answers0