0

I have a CSV source file with this schema defined.

["Name", "Address", "TaxId", "SS Number", "Mobile Number", "Gender", "LastVisited"]

From this CSV, these are the operations I need to do:

  1. Select a subset of columns, one at a time, and map all of them to this fixed schema:

    ["Name", "Address", "Mobile", "UniqueID", "UniqueIdentifierRefCode"]

So, for example, In the first iteration, I will be selecting only a subset of the columns:

[Col("Name"), Col("Address"), Col("Mobile Number"), Col("TaxId"), Lit("TaxIdentifier")]

  1. In the next iteration I need to select a different subset, but map them to the same fixed schema:

    [Col("Name"), Col("Address"), Col("Mobile Number"), Col("SS Number"), Lit("SocialSecurityNumber")]

I can do all of this by running a for loop, selecting out the columns, and doing a UnionAll in the end. But is there a better way to let Spark handle this?

Abhay Sibal
  • 129
  • 1
  • 12
  • is it that for some rows you have either a TaxID or an SS Number? if the row has a taxid, you want to use that and if the row has an ss number, you want to use that? – Ed Elliott Feb 26 '21 at 08:54
  • I want to derive both. For each row in the original CSV, I want to construct 2 rows in the output. The first row will have TaxId, and the next row will have SS Number. Both rows have a common schema, and should show up in the output. But I dont want to write loops, since number of rows that I am deriving can go upto 3 or 4 as well in the future. – Abhay Sibal Feb 26 '21 at 09:08

1 Answers1

0

You don't need a loop but can use a union for it, if you filter the dataframe to the rows you want, you can use Union - in the filter I used IsNotNull() but you can use whatever filter you like (if you aren't sure on the filter syntax give us more details and i'll help).

var taxId = dataFrame.Filter(Functions.Col("TaxId").IsNotNull())
    .WithColumn("UniqueId", Functions.Col("TaxId"));

var ssId = dataFrame.Filter(Functions.Col("ss").IsNotNull())
    .WithColumn("UniqueId", Functions.Col("ss"));

var unionedDataFrame = taxId.Union(ssId);
unionedDataFrame.Show()

Once you have your final data frame you can either select the columns you actually want or drop the columns you don't want:

unionedDataFrame.Drop("TaxId").Show()

or

unionedDataFrame.Select("name, UniqueId").Show()

Logically within Spark this is exactly the same as:

dataFrame.Filter(Functions.Col("TaxId").IsNotNull())
    .WithColumn("UniqueId", Functions.Col("TaxId"))
    .Union( 
      dataFrame.Filter(Functions.Col("ss").IsNotNull())
       .WithColumn("UniqueId", Functions.Col("ss"))
     ).Show()

Also to note that when you call a method you get a new DataFrame so the results of dataFrame.Filter() is a separate DataFrame to dataFrame but the important thing to note is that because of lazy evaluation, Spark creates the plan when it executes the query.

Ed Elliott
  • 6,666
  • 17
  • 32
  • I know this can be done, but can it be done by not creating 2 dataframes. I want to club both these operations into one. – Abhay Sibal Feb 26 '21 at 11:02
  • Spark uses lazy evaluation so as long as you don’t do something like .Show() then it will be one operation – have a look at https://data-flair.training/blogs/apache-spark-lazy-evaluation/ – Ed Elliott Feb 26 '21 at 12:04