2

I'm currently working in databricks and have a delta table with 20+ columns. I basically need to take a value from 1 column in each row, send it to an api which returns two values/columns, and then create the other 26 to merge the values back to the original delta table. So input is 28 columns and output is 28 columns. Currently my code looks like:

from pyspark.sql.types import *
from pyspark.sql import functions as F
import requests, uuid, json
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql.functions import col,lit
from functools import reduce

spark.conf.set("spark.sql.adaptive.enabled","true")
spark.conf.set("spark.databricks.adaptive.autoOptimizeShuffle.enabled", "true")
spark.sql('set spark.sql.execution.arrow.pyspark.enabled = true')
spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true")
spark.conf.set("spark.sql.parquet.compression.codec","gzip")
spark.conf.set("spark.sql.inMemorycolumnarStorage.compressed","true")
spark.conf.set("spark.databricks.optimizer.dynamicFilePruning","true");

output=spark.sql("select * from delta.`table`").cache()

SeriesAppend=[]

for i in output.collect():
    #small mapping fix
    if i['col1']=='val1':
      var0='a'
    elif i['col1']=='val2':
      var0='b'
    elif i['col1']=='val3':
      var0='c'
    elif i['col1']=='val4':
      var0='d'

    var0=set([var0])
    req_var = set(['a','b','c','d'])
    var_list=list(req_var-var0)

    #subscription info

    headers = {header}

    body = [{
      'text': i['col2']
    }]
    
    if len(i['col2'])<500:
      request = requests.post(constructed_url, params=params, headers=headers, json=body)
      response = request.json()
      dumps=json.dumps(response[0])
      loads = json.loads(dumps)
      json_rdd = sc.parallelize(loads)
      json_df = spark.read.json(json_rdd)
      json_df = json_df.withColumn('col1',lit(i['col1']))
      json_df = json_df.withColumn('col2',lit(i['col2']))
      json_df = json_df.withColumn('col3',lit(i['col3']))
      ...
      SeriesAppend.append(json_df)
    
    else:
      pass

Series_output=reduce(DataFrame.unionAll, SeriesAppend)

SAMPLE DF with only 3 columns:

df = spark.createDataFrame(
    [
        ("a", "cat","owner1"),  # create your data here, be consistent in the types.
        ("b", "dog","owner2"),
        ("c", "fish","owner3"),
        ("d", "fox","owner4"),
        ("e", "rat","owner5"),
    ],
    ["col1", "col2", "col3"])  # add your column names here

I really just need to write the response + other column values to a delta table, so dataframes are not necessarily required, but haven't found a faster way than the above. Right now, I can run 5 inputs, which returns 15 in 25.3 seconds without the unionAll. With the inclusion of the union, it turns into 3 minutes.

The final output would look like:

df = spark.createDataFrame(
    [
        ("a", "cat","owner1","MI", 48003),  # create your data here, be consistent in the types.
        ("b", "dog","owner2", "MI", 48003),
        ("c", "fish","owner3","MI", 48003),
        ("d", "fox","owner4","MI", 48003),
        ("e", "rat","owner5","MI", 48003),
    ],
    ["col1", "col2", "col3", "col4", "col5"])  # add your column names here

How can I make this faster in spark?

dcrowley01
  • 141
  • 2
  • 12
  • `collect` then `for` would run the entire thing in Driver, which is **not** scalable and **not** parallism. Have you considered writing a UDF instead? – pltc Oct 19 '21 at 01:59
  • So I know I can write a udf, but am not sure what that would like like for the above or what I can do in that udf to make it run faster. Can you demo an example? – dcrowley01 Oct 19 '21 at 03:01
  • I can try to make a demo if you explain to me what do you want to do, what's sample input and what's expected output. And I can assure you, using UDF (vs your code above) is absolutely scalable. It probably not faster with small dataset, but try with big one and you will see the difference – pltc Oct 19 '21 at 03:40
  • So the response is just two additional columns. So for the sample df I provided where 1 row would be col1, col2, col3... The output would be col1,col2,col3,col4,col5. We can label those columns with state and zipcode filler. I've added an output sample above for clarity. – dcrowley01 Oct 19 '21 at 18:04

1 Answers1

1

As mentioned in my comments, you should use UDF to distribute more workload to workers instead of collect and let a single machine (driver) to run it all. It's simply wrong approach and not scalable.

# This is your main function, pure Python and you can unittest it in any way you want.
# The most important about this function is:
# - everything must be encapsulated inside the function, no global variable works here
def req(col1, col2):
    if col1 == 'val1':
        var0 = 'a'
    elif col1 == 'val2':
        var0 = 'b'
    elif col1 == 'val3':
        var0 = 'c'
    elif col1 == 'val4':
        var0 = 'd'
    
    var0=set([var0])
    req_var = set(['a','b','c','d'])
    var_list = list(req_var - var0)
    
    #subscription info

    headers = {header} # !!! `header` must available **inside** this function, global won't work

    body = [{
      'text': col2
    }]
    
    if len(col2) < 500:
        # !!! same as `header`, `constructed_url` must available **inside** this function, global won't work
        request = requests.post(constructed_url, params=params, headers=headers, json=body)
        response = request.json()
        return (response.col4, response.col5)
    else:
        return None

# Now you wrap the function above into a Spark UDF.
# I'm using only 2 columns here as input, but you can use as many columns as you wish.
# Same as output, I'm using only a tuple with 2 elements, you can make it as many items as you wish.
df.withColumn('temp', F.udf(req, T.ArrayType(T.StringType()))('col1', 'col2')).show()

# Output
# +----+----+------+------------------+
# |col1|col2|  col3|              temp|
# +----+----+------+------------------+
# |   a| cat|owner1|[foo_cat, bar_cat]|
# |   b| dog|owner2|[foo_dog, bar_dog]|
# |   c|fish|owner3|              null|
# |   d| fox|owner4|              null|
# |   e| rat|owner5|              null|
# +----+----+------+------------------+

# Now all you have to do is extract the tuple and assign to separate columns
# (and delete temp column to cleanup)
(df
    .withColumn('col4', F.col('temp')[0])
    .withColumn('col5', F.col('temp')[1])
    .drop('temp')
    .show()
)

# Output
# +----+----+------+-------+-------+
# |col1|col2|  col3|   col4|   col5|
# +----+----+------+-------+-------+
# |   a| cat|owner1|foo_cat|bar_cat|
# |   b| dog|owner2|foo_dog|bar_dog|
# |   c|fish|owner3|   null|   null|
# |   d| fox|owner4|   null|   null|
# |   e| rat|owner5|   null|   null|
# +----+----+------+-------+-------+

pltc
  • 5,836
  • 1
  • 13
  • 31