0

I'm trying out the machine learning tutorial for PySpark.

Been following this tutorial here.

Ran into an issue when I got to the section "Correlations and Data Preparation".

Was trying to run this code here:

from pyspark.sql.types import DoubleType
from pyspark.sql.functions import UserDefinedFunction

binary_map = {'Yes':1.0, 'No':0.0, 'True':1.0, 'False':0.0}
toNum = UserDefinedFunction(lambda k: binary_map[k], DoubleType())

CV_data = CV_data.drop('State').drop('Area code') \
    .drop('Total day charge').drop('Total eve charge') \
    .drop('Total night charge').drop('Total intl charge') \
    .withColumn('Churn', toNum(CV_data['Churn'])) \
    .withColumn('International plan', toNum(CV_data['International plan'])) \
    .withColumn('Voice mail plan', toNum(CV_data['Voice mail plan'])).cache()


final_test_data = final_test_data.drop('State').drop('Area code') \
    .drop('Total day charge').drop('Total eve charge') \
    .drop('Total night charge').drop('Total intl charge') \
    .withColumn('Churn', toNum(final_test_data['Churn'])) \
    .withColumn('International plan', toNum(final_test_data['International plan'])) \
    .withColumn('Voice mail plan', toNum(final_test_data['Voice mail plan'])).cache()

This is the error message printed on the terminal (Partial).

17/06/20 17:58:53 WARN BlockManager: Putting block rdd_38_0 failed due to an exception
17/06/20 17:58:53 WARN BlockManager: Block rdd_38_0 could not be removed as it was not found on disk or in memory
17/06/20 17:58:53 WARN BlockManager: Putting block rdd_53_0 failed due to an exception
17/06/20 17:58:53 WARN BlockManager: Block rdd_53_0 could not be removed as it was not found on disk or in memory
17/06/20 17:58:53 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 16)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 106, in <lambda>
    func = lambda _, it: map(mapper, it)
  File "<string>", line 1, in <lambda>
  File "/home/main/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 70, in <lambda>
    return lambda *a: f(*a)
  File "<stdin>", line 1, in <lambda>
KeyError: False

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    ....

The rest of error message can be viewed from this document here.

Does anyone know what's the issue???

Thanks in advance.

jww
  • 97,681
  • 90
  • 411
  • 885
Print-ABC
  • 11
  • 2

1 Answers1

1

[Resolved]

I solved it after referencing this thread from 2 months back.

The main issue was as @user6910411 mentioned above. It was a data type error.

As I didn't have a need to print out all the data as numbers, I excluded the last 3 lines of code for the variable CV_data and final_test_data from the tutorial site:

Excluded from CV_data:

.withColumn('Churn', toNum(CV_data['Churn'])) \
.withColumn('International plan', toNum(CV_data['International plan'])) \
.withColumn('Voice mail plan', toNum(CV_data['Voice mail plan'])).cache()

Excluded from final_test_data:

.withColumn('Churn', toNum(final_test_data['Churn'])) \
.withColumn('International plan', toNum(final_test_data['International plan'])) \
.withColumn('Voice mail plan', toNum(final_test_data['Voice mail plan'])).cache()

The table printed out:

>>> pd.DataFrame(CV_data.take(5), columns=CV_data.columns).transpose()
17/06/21 13:49:54 WARN Executor: 1 block locks were not released by TID = 11:
[rdd_16_0]
                            0      1      2      3      4
Account length            128    107    137     84     75
International plan         No     No     No    Yes    Yes
Voice mail plan           Yes    Yes     No     No     No
Number vmail messages      25     26      0      0      0
Total day minutes       265.1  161.6  243.4  299.4  166.7
Total day calls           110    123    114     71    113
Total eve minutes       197.4  195.5  121.2   61.9  148.3
Total eve calls            99    103    110     88    122
Total night minutes     244.7  254.4  162.6  196.9  186.9
Total night calls          91    103    104     89    121
Total intl minutes         10   13.7   12.2    6.6   10.1
Total intl calls            3      3      5      7      3
Customer service calls      1      1      0      2      3
Churn                   False  False  False  False  False
Print-ABC
  • 11
  • 2