Apply Rank or partitioned row_num function in Data Fusion

Question

I want to implement rank or partitioned row_num function on my data in Data Fusion but I don't find any plugin to do so.

Is there any way to have this ?

I want to implement the below,

Suppose I have this above data, now I want to group the data based on AccountNumber and send the most recent record into one sink and rest to the others. So from the above data,

Sink1 is expected to have,

Sink2 ,

I was planning to have this segregation by applying the rank or row_number partition by AccountNumber and sort by Record_date desc like functionality and send the records with rank=1 or row_num=1 to one sink and rest to other.

Hi Esteves, I am not abe to provides the details here so I have updated the question with more details. Please check — SUDHIR GARG, Sep 23 '20 at 05:45

score 1 · Accepted Answer · answered Sep 23 '20 at 16:52

A good approach to solve your problem is using the Spark plugin. In order to add it to your Datafusion instance, go to HUB -> Plugins -> Search for Spark -> Deploy the plugin .Then you can find it on Analytics tab.

To give you an example of how could you use it I created the pipeline below:

This pipeline basically:

Reads a file from GCS.
Executes a rank function in your data
Filter the data with rank=1 and rank>1 in different branches
Save your data in different locations

Now lets take a look more deeply in each component:

1 - GCS: this is a simple GCS source. The file used for this example has the data showed below

2 - Spark_rank: this is a Spark plugin with the code below. The code basically created a temporary view with your data and them apply a query to rank your rows. After that your data comes back to the pipeline. Below you can also see the input and output data for this step. Please notice that the output is duplicated because it is delivered to two branches.

      def transform(df: DataFrame, context: SparkExecutionPluginContext) : DataFrame = {
          df.createTempView("source")
          df.sparkSession.sql("SELECT AccountNumber, Address, Record_date, RANK() OVER (PARTITION BY accountNumber ORDER BY record_date DESC) as rank FROM source")
    }

3 - Spark2 and Spark3: like the step below, this step uses the Spark plugin to transform the data. Spark2 gets only the data with rank = 1 using the code below

    def transform(df: DataFrame, context: SparkExecutionPluginContext) : DataFrame = {
      df.createTempView("source_0")
      df.sparkSession.sql("SELECT AccountNumber, Address, Record_date FROM 
    source_0 WHERE rank = 1")
    }

Spark3 gets the data with rank > 1 using the code below:

    def transform(df: DataFrame, context: SparkExecutionPluginContext) : DataFrame = {
      df.createTempView("source_1")
      df.sparkSession.sql("SELECT accountNumber, address, record_date FROM source_1 WHERE rank > 1")
    }

4 - GCS2 and GCS3: finally, in this step your data gets saved into GCS again.

Hi Esteves, The solution looks promising, but I am getting error: def transform(df: DataFrame, context: SparkExecutionPluginContext) : DataFrame = { df.createTempView("source") } \n\r this much is giving me error: Exception encountered while processing request : org/apache/spark/api/java/function/Function. — SUDHIR GARG, Sep 29 '20 at 05:32
Maybe you could replace createTempView with createOrReplaceTempView — rmesteves, Oct 02 '20 at 09:53
Hi Esteves, I have replaced createTempView with createOrReplaceTempView but still getting below error while validating my function: Exception encountered while processing request : org/apache/spark/api/java/function/Function — SUDHIR GARG, Oct 04 '20 at 05:53
@SUDHIRGARG Did you unmark the option "Compile at deployment time" in the plugin configurations? — rmesteves, Oct 04 '20 at 13:20
Yes Esteves, I have that set to False but still facing the same issue. — SUDHIR GARG, Oct 08 '20 at 06:23
I think that the approach of using dynamic spark is not as good as using custom built transformation plugin. The issue with that dynamic spark plugin, is that you don't have any debugging tool / support. There is no way to debug your code apart from the preview. But you aren't able to preview if your code is buggy. Meanwhile using custom transformation, you are able to debug line by line. This is one of the biggest drawbacks of CDAP<>spark — Alex B., Apr 25 '22 at 09:31

Apply Rank or partitioned row_num function in Data Fusion

1 Answers1