1

How to replace substrings of a string. For example, I created a data frame based on the following json format.

line1:{"F":{"P3":"1:0.01","P8":"3:0.03,4:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"2:0.01,3:0.02","P10":"5:0.02", ...},"I":"blah"}

I need to replace the substrings "1:", "2:", "3:", with "a:", "b:", "c:", and etc. So the result will be:

line1:{"F":{"P3":"a:0.01","P8":"c:0.03,d:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"b:0.01,c:0.02","P10":"e:0.02", ...},"I":"blah"}

Please consider that this is just an example the real replacement is substring replacement not character replacement.

Any guidance either in Scala or Pyspark is helpful.

Alan
  • 417
  • 1
  • 7
  • 22
  • Oh, sorry I think my explanation is confusing. I only want to replace the numbers in the string after ":". Basically, P1, P2, ... Pn are keys and I don't want to replace the keys or change their names. I only want to replace the strings in the values ==> "1:" to "a:", "2:" to "b:" and so on. – Alan Aug 22 '19 at 23:56
  • `1:` to `a:`, and `2:` to `b:`, and `27:` to ...? – jwvh Aug 23 '19 at 00:08
  • yes, something like that. – Alan Aug 23 '19 at 00:10
  • Like what? How is `27` translated? How is `32521` translated? – jwvh Aug 23 '19 at 00:13
  • so the whole string before ":" is replaced with a new string. "1:" to "hello_word:", "2:" to "another_hello_word",... "27:" to "how_are_you:", "50:" to "how_am_I". Let's say you have a dictionary (map) that maps numbers to a string, the size of the map can change and it is not necessary 27... and I want to replace the number (as key in the dictionary) with it's value that can be one of those examples that I put. So it is not necessarily 27 numbers but it can be higher numbers, where it can change during time... I mean the map can change as well as its size. – Alan Aug 23 '19 at 00:18

3 Answers3

2
from pyspark.sql.functions import *       
newDf = df.withColumn('col_name', regexp_replace('col_name', '1:', 'a:'))

Details here: Pyspark replace strings in Spark dataframe column

Pushkr
  • 3,591
  • 18
  • 31
P. Phalak
  • 457
  • 1
  • 4
  • 11
  • Thanks for the guidance, I am looking for something that does all the replacements at once. For example based on a map (dictionary) replacing all the keys with the values ==> as "1:" to "a:", "2:" to "b:", and so on. – Alan Aug 22 '19 at 22:19
  • This solution is partial, the one in the link will work properly. The only problem is that it doesn't solve the question fully. As I my question show the json file after conversion to dataframe will have two columns as "I" and "F" where "F" datatype is a struct and when I try to use the solution in the shared link will have error due to datatype mismatch as it looks for string but column "F" datatype is not string. – Alan Aug 23 '19 at 17:44
1

Let's say you have a collection of strings for possible modification (simplified for this example).

val data = Seq("1:0.01"
              ,"3:0.03,4:0.04"
              ,"2:0.01,3:0.02"
              ,"5:0.02")

And you have a dictionary of required conversions.

val num2name = Map("1" -> "A"
                  ,"2" -> "Bo"
                  ,"3" -> "Cy"
                  ,"4" -> "Dee")

From here you can use replaceSomeIn() to make the substitutions.

data.map("(\\d+):".r  //note: Map key is only part of the match pattern
                  .replaceSomeIn(_, m => num2name.get(m group 1)  //get replacement
                                                 .map(_ + ":")))  //restore ":"
//res0: Seq[String] = List(A:0.01
//                        ,Cy:0.03,Dee:0.04
//                        ,Bo:0.01,Cy:0.02
//                        ,5:0.02)

As you can see, "5:" is a match for the regex pattern but since the 5 part is not defined in num2name, the string is left unchanged.

jwvh
  • 50,871
  • 7
  • 38
  • 64
  • Thanks for your responding. Do you know how to do it in pyspark? what is the corresponding function for "replaceSomeIn" in pyspark? – Alan Aug 23 '19 at 17:21
1

This is the way I solved it in PySpark:

def _name_replacement(val, ordered_mapping):
    for key, value in ordered_mapping.items():
        val = val.replace(key, value)
    return val

mapping = {"1:":"aaa:", "2:":"bbb:", ..., "24:":"xxx:", "25:":"yyy:", ....}
ordered_mapping = OrderedDict(reversed(sorted(mapping.items(), key=lambda t: int(t[0][:-1]))))
replacing = udf(lambda x: _name_replacement(x, ordered_mapping))
new_df = df.withColumn("F", replacing(col("F")))

   
Alan
  • 417
  • 1
  • 7
  • 22