I'm trying to remove only words that are numerical from my words array, but the function I created is not working correctly. When I try to view the information from my dataframe, the following error message appears.
First I converted my string and word tokens
from pyspark.ml.feature import RegexTokenizer
regexTokenizer = RegexTokenizer(
inputCol="description",
outputCol="words_withnumber",
pattern="\\W"
)
data = regexTokenizer.transform(data)
I created the function to remove only the numbers
from pyspark.sql.functions import when, udf
from pyspark.sql.types import BooleanType
def is_digit(value):
if value:
return value.isdigit()
else:
return False
is_digit_udf = udf(is_digit, BooleanType())
Call function
data = data.withColumn(
'words_withoutnumber',
when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])
)
Error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Sample Dataframe:
+-----------+--------------------------------------------------------------+
|categoryid |description |
+-----------+--------------------------------------------------------------+
| 33004|["short", "sarja", "40567", "detalhe", "couro"] |
| 22033|["multipane", "6768686868686867868888", "220v", "branco"] |
+-----------+--------------------------------------------------------------+
Expected result:
+-----------+--------------------------------------------------------------+
|categoryid |description |
+-----------+--------------------------------------------------------------+
| 33004|["short", "sarja", "detalhe", "couro"] |
| 22033|["multipane", "220v", "branco"] |
+-----------+--------------------------------------------------------------+