Replace a substring of a string in pyspark dataframe

Question

How to replace substrings of a string. For example, I created a data frame based on the following json format.

line1:{"F":{"P3":"1:0.01","P8":"3:0.03,4:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"2:0.01,3:0.02","P10":"5:0.02", ...},"I":"blah"}

I need to replace the substrings "1:", "2:", "3:", with "a:", "b:", "c:", and etc. So the result will be:

line1:{"F":{"P3":"a:0.01","P8":"c:0.03,d:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"b:0.01,c:0.02","P10":"e:0.02", ...},"I":"blah"}

Please consider that this is just an example the real replacement is substring replacement not character replacement.

Any guidance either in Scala or Pyspark is helpful.

Oh, sorry I think my explanation is confusing. I only want to replace the numbers in the string after ":". Basically, P1, P2, ... Pn are keys and I don't want to replace the keys or change their names. I only want to replace the strings in the values ==> "1:" to "a:", "2:" to "b:" and so on. — Alan, Aug 22 '19 at 23:56
Like what? How is `27` translated? How is `32521` translated? — jwvh, Aug 23 '19 at 00:13
so the whole string before ":" is replaced with a new string. "1:" to "hello_word:", "2:" to "another_hello_word",... "27:" to "how_are_you:", "50:" to "how_am_I". Let's say you have a dictionary (map) that maps numbers to a string, the size of the map can change and it is not necessary 27... and I want to replace the number (as key in the dictionary) with it's value that can be one of those examples that I put. So it is not necessarily 27 numbers but it can be higher numbers, where it can change during time... I mean the map can change as well as its size. — Alan, Aug 23 '19 at 00:18

score 2 · Answer 1 · edited Aug 22 '19 at 22:31

2

from pyspark.sql.functions import *       
newDf = df.withColumn('col_name', regexp_replace('col_name', '1:', 'a:'))

Details here: Pyspark replace strings in Spark dataframe column

edited Aug 22 '19 at 22:31

Pushkr

3,591
18
31

answered Aug 22 '19 at 22:04

P. Phalak

457
1
4
11

Thanks for the guidance, I am looking for something that does all the replacements at once. For example based on a map (dictionary) replacing all the keys with the values ==> as "1:" to "a:", "2:" to "b:", and so on. – Alan Aug 22 '19 at 22:19
This solution is partial, the one in the link will work properly. The only problem is that it doesn't solve the question fully. As I my question show the json file after conversion to dataframe will have two columns as "I" and "F" where "F" datatype is a struct and when I try to use the solution in the shared link will have error due to datatype mismatch as it looks for string but column "F" datatype is not string. – Alan Aug 23 '19 at 17:44

jwvh · Answer 2 · 2019-08-23T06:43:48.650

Let's say you have a collection of strings for possible modification (simplified for this example).

val data = Seq("1:0.01"
              ,"3:0.03,4:0.04"
              ,"2:0.01,3:0.02"
              ,"5:0.02")

And you have a dictionary of required conversions.

val num2name = Map("1" -> "A"
                  ,"2" -> "Bo"
                  ,"3" -> "Cy"
                  ,"4" -> "Dee")

From here you can use replaceSomeIn() to make the substitutions.

data.map("(\\d+):".r  //note: Map key is only part of the match pattern
                  .replaceSomeIn(_, m => num2name.get(m group 1)  //get replacement
                                                 .map(_ + ":")))  //restore ":"
//res0: Seq[String] = List(A:0.01
//                        ,Cy:0.03,Dee:0.04
//                        ,Bo:0.01,Cy:0.02
//                        ,5:0.02)

As you can see, "5:" is a match for the regex pattern but since the 5 part is not defined in num2name, the string is left unchanged.

Thanks for your responding. Do you know how to do it in pyspark? what is the corresponding function for "replaceSomeIn" in pyspark? — Alan, Aug 23 '19 at 17:21

Alan · Accepted Answer · 2023-08-13T07:48:56.820

This is the way I solved it in PySpark:

def _name_replacement(val, ordered_mapping):
    for key, value in ordered_mapping.items():
        val = val.replace(key, value)
    return val

mapping = {"1:":"aaa:", "2:":"bbb:", ..., "24:":"xxx:", "25:":"yyy:", ....}
ordered_mapping = OrderedDict(reversed(sorted(mapping.items(), key=lambda t: int(t[0][:-1]))))
replacing = udf(lambda x: _name_replacement(x, ordered_mapping))
new_df = df.withColumn("F", replacing(col("F")))

Replace a substring of a string in pyspark dataframe

3 Answers3