How do I remove numeric strings from array in PySpark?

Question

I'm trying to remove only words that are numerical from my words array, but the function I created is not working correctly. When I try to view the information from my dataframe, the following error message appears.

First I converted my string and word tokens

from pyspark.ml.feature import RegexTokenizer
regexTokenizer = RegexTokenizer(
    inputCol="description",
    outputCol="words_withnumber",
    pattern="\\W"
)

data = regexTokenizer.transform(data)

I created the function to remove only the numbers

from pyspark.sql.functions import when, udf
from pyspark.sql.types import BooleanType

def is_digit(value):
    if value:
        return value.isdigit()
    else:
        return False

is_digit_udf = udf(is_digit, BooleanType())

Call function

data = data.withColumn(
    'words_withoutnumber', 
    when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])
)

Error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

Sample Dataframe:

+-----------+--------------------------------------------------------------+
|categoryid |description                                                   |
+-----------+--------------------------------------------------------------+
|      33004|["short", "sarja", "40567", "detalhe", "couro"]               | 
|      22033|["multipane", "6768686868686867868888", "220v", "branco"]     | 
+-----------+--------------------------------------------------------------+

Expected result:

+-----------+--------------------------------------------------------------+
|categoryid |description                                                   |
+-----------+--------------------------------------------------------------+
|      33004|["short", "sarja", "detalhe", "couro"]                        | 
|      22033|["multipane", "220v", "branco"]                               |
+-----------+--------------------------------------------------------------+

You udf expects a string, but you are passing an array to it. Also in your sample data frame should `description` be `words_withnumber`? — Psidom, Dec 27 '18 at 22:01
You need to [iterate over the array](https://stackoverflow.com/a/53486896/5858851) and filter out the desired words. What version of spark are you using? — pault, Dec 27 '18 at 22:05
@Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined — user3661384, Dec 28 '18 at 12:22
@pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago — user3661384, Dec 28 '18 at 12:44
Possible duplicate of [TypeError: Column is not iterable - How to iterate over ArrayType()?](https://stackoverflow.com/questions/48993439/typeerror-column-is-not-iterable-how-to-iterate-over-arraytype) — pault, Dec 28 '18 at 16:41

score 0 · Answer 1 · answered Dec 28 '18 at 16:31

As a help @pault the solution was this.

from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType

def is_digit(value):
    if value:
        return value.isdigit()
    else:
        return False

is_digit_udf = udf(is_digit, BooleanType()

Call function

from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.types import StructType

filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))

data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))

stack0114106 · Answer 2 · 2019-01-01T00:08:50.607

If you want to avoid udf() for performance reasons and if comma won't appear in your "description" column, then below scala solution would work. The df.withColumn() should be similar in pyspark.

Note: I also added a third record to show that the solution works when the numbers appear at the start/end of the array. Try it out.

scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]

scala> df.show(false)
+----------+-------------------------------------------------+
|categoryid|description                                      |
+----------+-------------------------------------------------+
|33004     |[short, sarja, 40567, detalhe, couro]            |
|22033     |[multipane, 6768686868686867868888, 220v, branco]|
|33033     |[0123, x220, 220v, 889]                          |
+----------+-------------------------------------------------+


scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""\b\d+\b""",""),"""^,|,$""",""),",,",","),",")).show(false)
+----------+-------------------------------------------------+------------------------------+
|categoryid|description                                      |newc                          |
+----------+-------------------------------------------------+------------------------------+
|33004     |[short, sarja, 40567, detalhe, couro]            |[short, sarja, detalhe, couro]|
|22033     |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco]     |
|33033     |[0123, x220, 220v, 889]                          |[x220, 220v]                  |
+----------+-------------------------------------------------+------------------------------+


scala>

Spark 2.4 answer

Using spark-sql in version 2.4 onwards, you can use the filter() higher order function and get the results

scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]

scala> df.createOrReplaceTempView("tab")

scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)
+----------+------------------------------+
|categoryid|fw                            |
+----------+------------------------------+
|33004     |[short, sarja, detalhe, couro]|
|22033     |[multipane, 220v, branco]     |
|33033     |[x220, 220v]                  |
+----------+------------------------------+


scala>

wwnde · Answer 3 · 2022-02-06T09:44:45.527

Now that you mentioned pyspark, lets include python.

DataFrame

data =[ (33004     , ['short', 'sarja', '40567', 'detalhe', 'couro']),
       (22033     , ['multipane', '6768686868686867868888', '220v', 'branco']),
       (33033    ,['123', 'x220', '220v', '889'] )  
]

df= spark.createDataFrame(data, ('categoryid','description' ))

Code

#Only alphanumeric

df.withColumn("description_filtered", expr("filter(description, x -> x rlike '([a-z]+)')")).show()

Outcome

+----------+--------------------+--------------------+
|categoryid|         description|description_filtered|
+----------+--------------------+--------------------+
|     33004|[short, sarja, 40...|[short, sarja, de...|
|     22033|[multipane, 67686...|[multipane, 220v,...|
|     33033|[123, x220, 220v,...|        [x220, 220v]|
+----------+--------------------+--------------------+

#Only alphabetical

df.withColumn("description_filtered", expr("filter(description, x -> x rlike '([^0-9]{2})')")).show()

Outcome

+----------+--------------------+--------------------+
|categoryid|         description|description_filtered|
+----------+--------------------+--------------------+
|     33004|[short, sarja, 40...|[short, sarja, de...|
|     22033|[multipane, 67686...| [multipane, branco]|
|     33033|[123, x220, 220v,...|                  []|
+----------+--------------------+--------------------+

score 0 · Answer 4 · answered Jul 31 '23 at 10:54

Spark 3.2+ has try_divide. If the attempt is unsuccessful (it happens when provided value is not numeric or is null), the function returns null. We can use it together with filter to remove numeric values from the array.

Full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('33004', ["short", "sarja",  "40567", "detalhe", "couro"]),
     ('22033', ["multipane", "6768686868686867868888", "220v", "branco"]),],
    ['categoryid', 'description']
)
df = df.withColumn(
    'description',
    F.expr("filter(description, x -> try_divide(x, 1) is null)")
)
df.show(truncate=0)
# +----------+------------------------------+
# |categoryid|description                   |
# +----------+------------------------------+
# |33004     |[short, sarja, detalhe, couro]|
# |22033     |[multipane, 220v, branco]     |
# +----------+------------------------------+

How do I remove numeric strings from array in PySpark?

4 Answers4