Spark Dataframe API to Select multiple columns, map them to a fixed set, and Union ALL

Question

I have a CSV source file with this schema defined.

["Name", "Address", "TaxId", "SS Number", "Mobile Number", "Gender", "LastVisited"]

From this CSV, these are the operations I need to do:

Select a subset of columns, one at a time, and map all of them to this fixed schema:

["Name", "Address", "Mobile", "UniqueID", "UniqueIdentifierRefCode"]

So, for example, In the first iteration, I will be selecting only a subset of the columns:

[Col("Name"), Col("Address"), Col("Mobile Number"), Col("TaxId"), Lit("TaxIdentifier")]

In the next iteration I need to select a different subset, but map them to the same fixed schema:

[Col("Name"), Col("Address"), Col("Mobile Number"), Col("SS Number"), Lit("SocialSecurityNumber")]

I can do all of this by running a for loop, selecting out the columns, and doing a UnionAll in the end. But is there a better way to let Spark handle this?

is it that for some rows you have either a TaxID or an SS Number? if the row has a taxid, you want to use that and if the row has an ss number, you want to use that? — Ed Elliott, Feb 26 '21 at 08:54
I want to derive both. For each row in the original CSV, I want to construct 2 rows in the output. The first row will have TaxId, and the next row will have SS Number. Both rows have a common schema, and should show up in the output. But I dont want to write loops, since number of rows that I am deriving can go upto 3 or 4 as well in the future. — Abhay Sibal, Feb 26 '21 at 09:08

Ed Elliott · Answer 1 · 2021-02-26T12:07:57.870

You don't need a loop but can use a union for it, if you filter the dataframe to the rows you want, you can use Union - in the filter I used IsNotNull() but you can use whatever filter you like (if you aren't sure on the filter syntax give us more details and i'll help).

var taxId = dataFrame.Filter(Functions.Col("TaxId").IsNotNull())
    .WithColumn("UniqueId", Functions.Col("TaxId"));

var ssId = dataFrame.Filter(Functions.Col("ss").IsNotNull())
    .WithColumn("UniqueId", Functions.Col("ss"));

var unionedDataFrame = taxId.Union(ssId);
unionedDataFrame.Show()

Once you have your final data frame you can either select the columns you actually want or drop the columns you don't want:

unionedDataFrame.Drop("TaxId").Show()

or

unionedDataFrame.Select("name, UniqueId").Show()

Logically within Spark this is exactly the same as:

dataFrame.Filter(Functions.Col("TaxId").IsNotNull())
    .WithColumn("UniqueId", Functions.Col("TaxId"))
    .Union( 
      dataFrame.Filter(Functions.Col("ss").IsNotNull())
       .WithColumn("UniqueId", Functions.Col("ss"))
     ).Show()

Also to note that when you call a method you get a new DataFrame so the results of dataFrame.Filter() is a separate DataFrame to dataFrame but the important thing to note is that because of lazy evaluation, Spark creates the plan when it executes the query.

I know this can be done, but can it be done by not creating 2 dataframes. I want to club both these operations into one. — Abhay Sibal, Feb 26 '21 at 11:02
Spark uses lazy evaluation so as long as you don’t do something like .Show() then it will be one operation – have a look at https://data-flair.training/blogs/apache-spark-lazy-evaluation/ — Ed Elliott, Feb 26 '21 at 12:04

Spark Dataframe API to Select multiple columns, map them to a fixed set, and Union ALL

1 Answers1