0

I have the following dictionaries inside variables:

sk_channel_types = {"facebooknotification": 2,
                    "facebookmessenger": 9,
                    "onsitenotification": 3,
                    "pushnotification": 6,
                    "pushnotificationmessage": 6,
                    "lightbox": 4,
                    "onsitemessage": 7,
                    "mailmessage": 1}

sk_story_types = {"welcome": 7,
                  "rescue": 13,
                  "frequency": 4,
                  "abandoncart": 6,
                  "pricedrop": 16,
                  "manual": 5,
                  "searchbykeyword": 30,
                  "sazonality": 31,
                  "bestdayforpurchase": 28,
                  "pricechange": 32,
                  "availability": 33,
                  "toptrending": 1,
                  "toptrendingbycluster": 2,
                  "toptrendingwithpricelimit": 3,
                  "frequencyview": 4,
                  "manualnotification": 5,
                  "trending": 9,
                  "toptrendingbykeyword": 9}

And this is my current spark dataframe:

ID StoryType Type StoryId
abcdefghijklmnopqrst AbandonCart MailMessage 56465465456456456465
lçdkçlskdçlsdkçlskdç ManualNotification MailMessage 60983099380938390833
uahuahuahauhauahuaha ManualNotification MailMessage 49438093890484984949
sklçskçlskdkcnopeieo ManualNotification MailMessage 93084098409840984098
2d5fe941380938098948 ManualNotification MailMessage 49809380398094894844
9883jkjd3eu0dj0j3930 ManualNotification MailMessage 636f50c9380938093893

I need to replace the StoryType and Type columns with their respective numbers, as per the variables, like this:

ID StoryType Type StoryId
abcdefghijklmnopqrst 6 1 56465465456456456465
lçdkçlskdçlsdkçlskdç 5 1 60983099380938390833
uahuahuahauhauahuaha 5 1 49438093890484984949
sklçskçlskdkcnopeieo 5 1 93084098409840984098
2d5fe941380938098948 5 1 49809380398094894844
9883jkjd3eu0dj0j3930 5 1 636f50c9380938093893

How can I do this? Can I use a case with low? I'm new to Pyspark.

  • If you invert those dictionaries (that is, `{2: "facebooknotification", 4: "lightbox",...}`), then it's trivial to use `apply` to lookup the new values. – Tim Roberts Feb 22 '22 at 19:51
  • Have a look here: https://stackoverflow.com/questions/42980704/pyspark-create-new-column-with-mapping-from-a-dict – ScootCork Feb 22 '22 at 20:00
  • @ScootCork Did not work :( – Caroline Leite Feb 22 '22 at 20:06
  • USE `from itertools import chain` `from pyspark.sql.functions import create_map, lit` `m_expr1 = create_map([lit(x) for x in chain(*sk_channel_types.items())])` `m_expr2 = create_map([lit(x) for x in chain(*sk_story_types.items())])` `df = df.withColumn('Type', m_expr1[lower(df['Type'])]).withColumn('StoryType', m_expr2[lower(df['StoryType'])])` – wwnde Feb 22 '22 at 22:04

1 Answers1

1

Since the dictionaries are small the efficient way is to make them broadcasted dataset and join them to the dataset.

Hossein Torabi
  • 694
  • 1
  • 7
  • 18