Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

Question

I am using spark 2.4.5 and I need to calculate the sentiment score from a token list column (MeaningfulWords column) of df1, according to the words in df2 (spanish sentiment dictionary). In df1 I must create a new column with the scores list of tokens and another column with the mean of scores (sum of scores / count words) of each record. If any token in the list (df1) is not in the dictionary (df2), zero is scored.

The Dataframes looks like this:

df1.select("ID","MeaningfulWords").show(truncate=True, n=5)
+------------------+------------------------------+
|                ID|               MeaningfulWords|
+------------------+------------------------------+
|abcde00000qMQ00001|[casa, alejado, buen, gusto...|
|abcde00000qMq00002|[clientes, contentos, servi...|
|abcde00000qMQ00003|                 [resto, bien]|
|abcde00000qMQ00004|[mal, servicio, no, antiend...|
|abcde00000qMq00005|[gestion, adecuada, proble ...|
+------------------+------------------------------+

df2.show(5)
+-----+----------+
|score|      word|
+-----+----------+
| 1.68|abandonado|
| 3.18|    abejas|
|  2.8|    aborto|
| 2.46| abrasador|
| 8.13|    abrazo|
+-----+----------+

The new columns to add in df1, should look like this:

+------------------+---------------------+
|         MeanScore|            ScoreList|
+------------------+---------------------+
|              2.95|[3.10, 2.50, 1.28,...|
|              2.15|[1.15, 3.50, 2.75,...|
|              2.75|[4.20, 1.00, 1.75,...|
|              3.25|[3.25, 2.50, 3.20,...|
|              3.15|[2.20, 3.10, 1.28,...|
+------------------+---------------------+

I have reviewed some options using .join, but using columns with different data types gives error. I have also tried converting the Dataframes to RDD and calling a function:

def map_words_to_values(review_words, dict_df):
return [dict_df[word] for word in review_words if word in dict_df]

RDD1=swRemoved.rdd.map(list) 
RDD2=Dict_df.rdd.map(list)

reviewsRDD_dict_values = RDD1.map(lambda tuple: (tuple[0], map_words_to_values(tuple[1], RDD2)))
reviewsRDD_dict_values.take(3)

But with this option I get the error:

PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

I have found some examples to score text using afinn library. But it doesn't works with spanish text.

I wanna try to utilize native functions of pyspark instead of using udfs to avoid affect the performance, if it's possible. But I'm a begginer in spark and I would like to find the spark way to do that.

murtihash · Accepted Answer · 2020-05-25T17:46:01.490

3

You could do this by first joining using array_contains word, then groupBy with aggregations of first, collect_list, and mean.(spark2.4+)

welcome to SO

df1.show()

#+------------------+----------------------------+
#|ID                |MeaningfulWords             |
#+------------------+----------------------------+
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|
#|abcde00000qMq00002|[clientes, contentos, servi]|
#|abcde00000qMQ00003|[resto, bien]               |
#+------------------+----------------------------+

df2.show()

#+-----+---------+
#|score|     word|
#+-----+---------+
#| 1.68|     casa|
#|  2.8|  alejado|
#| 1.03|     buen|
#| 3.68|    gusto|
#| 0.68| clientes|
#|  2.1|contentos|
#| 2.68|    servi|
#| 1.18|    resto|
#| 1.98|     bien|
#+-----+---------+


from pyspark.sql import functions as F
df1.join(df2, F.expr("""array_contains(MeaningfulWords,word)"""),'left')\
   .groupBy("ID").agg(F.first("MeaningfulWords").alias("MeaningfullWords")\
                      ,F.collect_list("score").alias("ScoreList")\
                      ,F.mean("score").alias("MeanScore"))\
                      .show(truncate=False)

#+------------------+----------------------------+-----------------------+------------------+
#|ID                |MeaningfullWords            |ScoreList              |MeanScore         |
#+------------------+----------------------------+-----------------------+------------------+
#|abcde00000qMQ00003|[resto, bien]               |[1.18, 1.98]           |1.58              |
#|abcde00000qMq00002|[clientes, contentos, servi]|[0.68, 2.1, 2.68]      |1.8200000000000003|
#|abcde00000qMQ00001|[casa, alejado, buen, gusto]|[1.68, 2.8, 1.03, 3.68]|2.2975            |
#+------------------+----------------------------+-----------------------+------------------+

edited May 25 '20 at 17:46

answered May 09 '20 at 01:52

murtihash

8,030
1
14
26

1

@reply your solution is great! and it perfectly solves my problem. I have added a `,how = 'left'` in the `join`, to keep all df1 rows. For me, think in the Spark way instead of pandas solution is a bit difficult yet. – EJS May 09 '20 at 10:29
Hi @murtihash, Do you know why in the result, it lost columns of df1?. Im trying to add ` .select (df1.col('year'),df1.col('Meaningfulwords'), df1.col('ID'))` before `.groupBy`, but it does'nt show me the 'year' column. Would this be a new ask in SO?. Thanks in advance for your help!. – EJS May 25 '20 at 14:26
Wrap year in the .agg as a .first just like u did with meaningful words – murtihash May 25 '20 at 16:52
Thanks a lot, I'm trying this option but I get a `java.lang.NullPointerException`.I'm not sure what happen. – EJS May 25 '20 at 17:17
would recommend to open up a new question with more info on ur task – murtihash May 25 '20 at 17:29
Thanks @murtihash, I opened a new ask in [62008988](https://stackoverflow.com/questions/62008988/error-including-a-column-in-a-join-between-spark-dataframes) – EJS May 25 '20 at 19:04
@Eli-js, i edited my solution for this one, using higher order function was overkill here and we can compute mean using .agg only. – murtihash May 25 '20 at 20:12
Thanks a lot!, I already updated the code. Finally I repeated the tests to include the 'Year' [62008988](https://stackoverflow.com/questions/62008988/error-including-a-column-in-a-join-between-spark-dataframes) but that is still failing. – EJS May 25 '20 at 21:12

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

1 Answers1

Linked