Is there a SparkR function equivalent to unique and orderBy from R, that brings all the columns?

Question

I'm one month old in the Data world and my goal is to refactor existing local R scripts to work with SparkR on Databricks.

This is the R code:

minmaxAcctDates <- intLoadFiles("Accounts_BalanceEOD", monthID)
minmaxAcctDates$CUSTOMER_NUMBER <- as.integer(minmaxAcctDates$CUSTOMER_NUMBER)
  
AccountsFilteredMin <- unique(setDT(minmaxAcctDates)[order(BOOK_BALANCE_LCY)], by = "CUSTOMER_NUMBER") 
AccountsFilteredMax <- unique(setDT(minmaxAcctDates)[order(BOOK_BALANCE_LCY, decreasing = TRUE)], by = "CUSTOMER_NUMBER")
  
AccountsFilteredMin$MIN_AC_BAL_DATE<-AccountsFilteredMin$CALENDAR_DATE
AccountsFilteredMax$MAX_AC_BAL_DATE<-AccountsFilteredMax$CALENDAR_DATE
  
AccountsFilteredMin <- AccountsFilteredMin[, c("CUSTOMER_NUMBER", "MIN_AC_BAL_DATE")]
AccountsFilteredMax <- AccountsFilteredMax[, c("CUSTOMER_NUMBER", "MAX_AC_BAL_DATE")]

What I tried to reproduce in SparkR:

minmaxAcctDates <- read.df("abfss://x@adlsstorage.dfs.core.windows.net/x/Accounts_BalanceEOD.csv", source = "csv", header="true", inferSchema = "true")

AccountsFilteredMin <- (SparkR::distinct(minmaxAcctDates))
AccountsFilteredMin = groupBy(minmaxAcctDates, minmaxAcctDates$CUSTOMER_NUMBER) %>% agg(min(minmaxAcctDates$BOOK_BALANCE_LCY))
SparkR::collect(AccountsFilteredMin)
AccountsFilteredMin = join(AccountsFilteredMin, minmaxAcctDates)
dropDuplicates(AccountsFilteredMax, "CUSTOMER_NUMBER")

AccountsFilteredMax <- (SparkR::distinct(minmaxAcctDates))
AccountsFilteredMax = groupBy(minmaxAcctDates, minmaxAcctDates$CUSTOMER_NUMBER) %>% agg(max(minmaxAcctDates$BOOK_BALANCE_LCY))
SparkR::collect(AccountsFilteredMax)
AccountsFilteredMax = join(AccountsFilteredMax, minmaxAcctDates)
dropDuplicates(AccountsFilteredMax, "CUSTOMER_NUMBER")

AccountsFilteredMin <- AccountsFilteredMin[, c("CUSTOMER_NUMBER", "MIN_AC_BAL_DATE")]
AccountsFilteredMax <- AccountsFilteredMax[, c("CUSTOMER_NUMBER", "MAX_AC_BAL_DATE")]

From trial and error, the issues that I am facing are:

Duplicate columns
Ambiguous column for CUSTOMER_NUMBER

Grateful if I could receive some help/guidance please, thanks.

Is there a SparkR function equivalent to unique and orderBy from R, that brings all the columns?

0 Answers0