0

I've written the below code:

def func_sql(table,query): 
    q=spark.sql(eval(f"f'{query}'")).collect()[0][0]
    print(q)

lst=[]
for i in range(int_df.count()):
    lst.append(int_df.collect()[i][3])
# print(lst)

for x in lst:
    func_sql('table','{}'.format(x)) 

It's giving the following output:

enter image description here

I want to add this output as under a new column in a dataframe, but I'm not able to find a way to do that. I tried using UDF but that requires modifying an already existing column.

Any idea how I can do this?

  • do you want to create a new dataframe or add a column in an existing dataframe? how did you get the output -- is it a list? – samkart Jul 20 '22 at 08:14
  • I want to add a column in an existing dataframe. The output is the result of a for loop that I've mentioned in the code – Aishani Singh Jul 20 '22 at 08:23
  • a for loop resulting in what? single values or a whole list (like the `lst` mentioned in the Q?? – samkart Jul 20 '22 at 08:25
  • The output are the numerical values which I've mentioned below the code – Aishani Singh Jul 20 '22 at 08:31
  • spark won't let you add these values in a column. can you share details on target dataframe and how/where you get the values? there may be another way to achieve what you want – samkart Jul 20 '22 at 08:34
  • Okay so the target dataframe has three columns, 'Column','Rule' and 'Query'. The 'Query' column contains sql queries which I want to run on values of the 'Column' field. I've got the result after running the queries by defining a function that runs a query, getting all the queries as a list (lst in the code) and one by one running those queries (in the for loop). The output is the answer of the queries. I want to add a new column, 'Result' which has these values, to the dataframe. – Aishani Singh Jul 20 '22 at 08:44
  • okay, so what i get is -- `lst` has list of queries that is run by `func_sql()` function. this function doesn't return anything so the values are unusable -- return it to a list of tuples where the tuples can be (, ). this list of tuples can then be converted to a spark dataframe. so, you'll recreate the query dataframe but with the additional field. – samkart Jul 20 '22 at 08:49

1 Answers1

1

Based on your comments, here's what I got

  • lst is the list of queries where each element is a query
  • func_sql() function runs the query and prints an output
  • you want the output in the query dataframe -- I'm guessing int_df is the query dataframe

We can recreate the whole query dataframe using the int_df rows. Because you're collecting the dataframe for the queries anyway, we'll collect it and use all columns from the collection.

But first, we'll update the func_sql() to return the result of the query instead of printing it.

def func_sql(table, query): 
    q = spark.sql(eval(f"f'{query}'")).collect()[0][0]
    return q

The part where you create lst will not be required anymore and we can directly iterate over the collection output from collect(). Now, int_df (your dataframe having queries) has 3 columns in this order - 'Column', 'Rule' and 'Query'. We'll use this info in iteration.

res_list = []

for column, rule, query in int_df.collect():
    res_list.append((column, rule, query, func_sql('table','{}'.format(query))))

res_list will now have all column values from int_df with an additional value from the query result. This variable can now be used to create a spark dataframe.

res_sdf = spark.sparkContext.parallelize(res_list).toDF(['column', 'rule', 'query', 'result'])

To show an example of how it works, I created a dummy dataframe and will print the query instead of using func_sql().

data_ls = [
    ('col1', 'rule1', 'query1'),
    ('col2', 'rule2', 'query2')
]

data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['column', 'rule', 'query'])

# +------+-----+------+
# |column| rule| query|
# +------+-----+------+
# |  col1|rule1|query1|
# |  col2|rule2|query2|
# +------+-----+------+

res_list = []

for column, rule, query in data_sdf.collect():
    res_list.append((column, rule, query, '{0} result here'.format(query)))

print(res_list)
# [('col1', 'rule1', 'query1', 'query1 result here'),
#  ('col2', 'rule2', 'query2', 'query2 result here')]

res_sdf = spark.sparkContext.parallelize(res_list).toDF(['column', 'rule', 'query', 'result'])

# +------+-----+------+------------------+
# |column| rule| query|            result|
# +------+-----+------+------------------+
# |  col1|rule1|query1|query1 result here|
# |  col2|rule2|query2|query2 result here|
# +------+-----+------+------------------+
samkart
  • 6,007
  • 2
  • 14
  • 29