1

I have a dictionary my_dict_of_df which consists of variable number of dataframes each time my program runs. I want to create a new dataframe that is a union of all these dataframes.

My dataframes look like-

my_dict_of_df["df_1"], my_dict_of_df["df_2"] and so on...

How do I union all these dataframes?

kev
  • 2,741
  • 5
  • 22
  • 48
  • 1
    Possible duplicate of [Spark union of multiple RDDs](https://stackoverflow.com/questions/33743978/spark-union-of-multiple-rdds) – pault Mar 14 '19 at 18:20
  • @pault I've consulted that answer, but the return value is a list of dataframe objects and not a new unionized dataframe. I intend to do further operations on this newly created dataframe. – kev Mar 14 '19 at 21:08
  • The return value on the linked post and the code in my other comment is a DataFrame. It is not a list of DataFrames. – pault Mar 14 '19 at 21:19
  • 1
    It's because of the way `unionAll` is defined here to take in `*dfs`. Either call it by unpacking your values: `unionAll(*my_dic.values())` OR change the function definition to take a single (iterable) argument: `def unionAll(dfs): return reduce(DataFrame.unionAll, dfs)` – pault Mar 15 '19 at 14:14

1 Answers1

12

Consulted the solution given here, thanks to @pault.

from functools import reduce
from pyspark.sql import DataFrame

def union_all(*dfs):
    return reduce(DataFrame.union, dfs)

df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))
df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))
df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))

my_dic = {}
my_dic["df1"] = df1
my_dic["df2"] = df2
my_dic["df3"] = df3

new_df = union_all(*my_dic.values())

print(type(new_df))   # <class 'pyspark.sql.dataframe.DataFrame'>
print(new_df.show())  

"""
+---+----+
|  k|   v|
+---+----+
|  1|foo1|
|  2|bar1|
|  3|foo2|
|  4|bar2|
|  5|foo3|
|  6|bar3|
+---+----+
"""

Edit: using DataFrame.union instead of DataFrame.unionAll since the latter is deprecated.

kev
  • 2,741
  • 5
  • 22
  • 48
  • 3
    I'd suggest using DataFrame.unionByName if you are unsure your columns are in the same order – Kate Jul 02 '20 at 16:27