I'm one month old in the Data world and my goal is to refactor existing local R scripts to work with SparkR on Databricks.
This is the R code:
minmaxAcctDates <- intLoadFiles("Accounts_BalanceEOD", monthID)
minmaxAcctDates$CUSTOMER_NUMBER <- as.integer(minmaxAcctDates$CUSTOMER_NUMBER)
AccountsFilteredMin <- unique(setDT(minmaxAcctDates)[order(BOOK_BALANCE_LCY)], by = "CUSTOMER_NUMBER")
AccountsFilteredMax <- unique(setDT(minmaxAcctDates)[order(BOOK_BALANCE_LCY, decreasing = TRUE)], by = "CUSTOMER_NUMBER")
AccountsFilteredMin$MIN_AC_BAL_DATE<-AccountsFilteredMin$CALENDAR_DATE
AccountsFilteredMax$MAX_AC_BAL_DATE<-AccountsFilteredMax$CALENDAR_DATE
AccountsFilteredMin <- AccountsFilteredMin[, c("CUSTOMER_NUMBER", "MIN_AC_BAL_DATE")]
AccountsFilteredMax <- AccountsFilteredMax[, c("CUSTOMER_NUMBER", "MAX_AC_BAL_DATE")]
What I tried to reproduce in SparkR:
minmaxAcctDates <- read.df("abfss://x@adlsstorage.dfs.core.windows.net/x/Accounts_BalanceEOD.csv", source = "csv", header="true", inferSchema = "true")
AccountsFilteredMin <- (SparkR::distinct(minmaxAcctDates))
AccountsFilteredMin = groupBy(minmaxAcctDates, minmaxAcctDates$CUSTOMER_NUMBER) %>% agg(min(minmaxAcctDates$BOOK_BALANCE_LCY))
SparkR::collect(AccountsFilteredMin)
AccountsFilteredMin = join(AccountsFilteredMin, minmaxAcctDates)
dropDuplicates(AccountsFilteredMax, "CUSTOMER_NUMBER")
AccountsFilteredMax <- (SparkR::distinct(minmaxAcctDates))
AccountsFilteredMax = groupBy(minmaxAcctDates, minmaxAcctDates$CUSTOMER_NUMBER) %>% agg(max(minmaxAcctDates$BOOK_BALANCE_LCY))
SparkR::collect(AccountsFilteredMax)
AccountsFilteredMax = join(AccountsFilteredMax, minmaxAcctDates)
dropDuplicates(AccountsFilteredMax, "CUSTOMER_NUMBER")
AccountsFilteredMin <- AccountsFilteredMin[, c("CUSTOMER_NUMBER", "MIN_AC_BAL_DATE")]
AccountsFilteredMax <- AccountsFilteredMax[, c("CUSTOMER_NUMBER", "MAX_AC_BAL_DATE")]
From trial and error, the issues that I am facing are:
- Duplicate columns
- Ambiguous column for CUSTOMER_NUMBER
Grateful if I could receive some help/guidance please, thanks.