1

In pyspark, I'm trying to replace multiple text values in a column by the value that are present in the columns which names are present in the calc column (formula).

So to be clear, here is an example :

Input:

|param_1|param_2|calc 
|-------|-------|--------
|Cell 1 |Cell 2 |param_1-param_2
|Cell 3 |Cell 4 |param_2/param_1

Output needed:

|param_1|param_2|calc 
|-------|-------|--------
|Cell 1 |Cell 2 |Cell 1-Cell 2
|Cell 3 |Cell 4 |Cell 4/Cell 3

In the column calc, the default value is a formula. It can be something as much as simple as the ones provided above or it can be something like "2*(param_8-param_4)/param_2-(param_3/param_7)". What I'm looking for is something to substitute all the param_x by the values in the related columns regarding the names.

I've tried a lot of things but nothing works at all and most of the time when I use replace or regex_replace with a column for the replacement value, the error the column is not iterable occurs.

Moreover, the columns param_1, param_2, ..., param_x are generated dynamically and the calc column values can some of these columns but not necessary all of them.

Could you help me on the subject with a dynamic solution ?

Thank you so much. Best regards

Ronak Jain
  • 3,073
  • 1
  • 11
  • 17
Cazau
  • 13
  • 3
  • What about you take a tutorial on `concat`, `concat_ws`. Try using them to solve this problem and if you run into issues let us know what the error is? pyspark documentation can be found here https://spark.apache.org/docs/latest/api/python/reference/index.html – wwnde Jan 05 '23 at 21:27
  • What is the connection between my need to replace variables in a formula and the functions concat ? And I've just read all the documentation as you provided in case a new purpose of them were listed but no, they just concat data. – Cazau Jan 05 '23 at 23:18
  • So where is the formula? That wasnt clear, from what you provided, looks to me like calc is a concat of param1 and 2 – wwnde Jan 05 '23 at 23:23
  • In the column calc, the default value is the formula. It can be something as much as simple as the ones provided above or it can be something like "2*(param_8-param_4)/param_2-(param_3/param_7)". What I'm looking for is something to substitute all the param_x by the values in the related columns regarding the names. – Cazau Jan 06 '23 at 08:58

1 Answers1

1

Update: Turned out I misunderstood the requirement. This would work:

for exp in ["regexp_replace(calc, '"+col+"', "+col+")" for col in df.schema.names]:
   df=df.withColumn("calc", F.expr(exp))

Yet Another Update: To Handle Null Values add coalesce:

for exp in ["coalesce(regexp_replace(calc, '"+col+"', "+col+"), calc)" for col in df.schema.names]:
   df=df.withColumn("calc", F.expr(exp))

Input/Output: In/Out

------- Keeping the below section for a while just for reference -------

You can't directly do that - as you won't be able to use column value directly unless you collect in a python object (which is obviously not recommended).

This would work with the same:

    df = spark.createDataFrame([["1","2", "param_1 - param_2"],["3","4", "2*param_1 + param_2"]]).toDF("param_1", "param_2", "calc");

    df.show()

    df=df.withColumn("row_num", F.row_number().over(Window.orderBy(F.lit("dummy"))))

    as_dict = {row.asDict()["row_num"]:row.asDict()["calc"] for row in df.select("row_num", "calc").collect()}

    expression = f"""CASE {' '.join([f"WHEN row_num ='{k}' THEN ({v})" for k,v in as_dict.items()])} \
            ELSE NULL END""";

    df.withColumn("Result", F.expr(expression)).show();

Input/Output:

Output

Ronak Jain
  • 3,073
  • 1
  • 11
  • 17
  • Hi, thank you. It's great for the next step of other process. It works fine with numbers but the actual step here is substitute a label by other label, my example output is really what I need now. If you have a trick to do it :-) – Cazau Jan 06 '23 at 16:09
  • @Cazau Ah ok, got it. Realised just now, let me check I'll update my answer once done. – Ronak Jain Jan 06 '23 at 16:26
  • @Cazau Updated the answer accordingly, let me know if you need any help. – Ronak Jain Jan 06 '23 at 17:10
  • Thank you so much @Ronak Jain, both solutions will be used for our project. It's almost alright, just a little inconvenience, if for example column calc = param_1 and column param_2 is null then the calc column is null too. – Cazau Jan 09 '23 at 12:40
  • @Cazau That can be fixed using a simple colaese. let me know if you need help with the same. Also, consider upvoting and do accept the answer if it helps. – Ronak Jain Jan 09 '23 at 13:13
  • @Cazau Fixed Null issue for you, updated answer :) – Ronak Jain Jan 09 '23 at 13:23
  • 1
    Perfect, a big help, thanks @Ronak Jain ! – Cazau Jan 09 '23 at 14:55