1

Still new to the world of Azure Databricks, the use of SparkR remains very obscure to me, even for very simple tasks...

It took me a very long time to find how to count distinct values, and I'm not sure it's the right way to go :

library(SparkR)
sparkR.session()

DW <- sql("select * from db.mytable")
nb.var <- head(summarize(DW, n_distinct(DW$VAR)))

I thought I found, but nb.per is not an object, but still a dataframe...

class(nb.per)
[1] "data.frame"

I tried :

nb.per <- as.numeric(head(summarize(DW, n_distinct(DW$PERIODE))))

It seems ok, but I'm pretty sure there is a better way to achieve this ?

Thanks !

Discus23
  • 471
  • 2
  • 11

2 Answers2

0

The SparkR::sql function returns a SparkDataFrame.

In order to use this in R as an R data.frame, you can simply coerce it:

 as.data.frame(sql("select * from db.mytable"))
Robert Long
  • 5,722
  • 5
  • 29
  • 50
  • On very (very) large table, won't I lose in performance ? I wanted to work with SparkR, which seems the most efficient for working on Hive tables ? – Discus23 Jan 12 '22 at 13:09
  • Yes, but the question literally asks how to store the result into an R object, and I think I've answered that. – Robert Long Jan 12 '22 at 13:10
  • You are right. Maybe I am lost because I have to rewrite my R data.table script on Databricks. Are there any SparkR objects to store a numeric value ? – Discus23 Jan 12 '22 at 13:21
  • I would suggest asking a new question about that. Perhaps include a portion of the R dataframe code and show any attempts that you've made yourself. You might also want to look into the `sparklyr` package. – Robert Long Jan 12 '22 at 13:48
0

Since you are anyway using Spark SQL, a very simple approach would be to do like this:
nb.per <- `[[`(SparkR::collect(SparkR::sql("select count(distinct VAR) from db.mytable")), 1).
And using SparkR APIs like:

DW <- SparkR::tableToDF("db.mytable")
nb.per <- `[[`(SparkR::collect(SparkR::agg(DW, SparkR::countDistinct(SparkR::column("VAR")))), 1)
Vivek Atal
  • 468
  • 5
  • 11
  • Seems great to me ! But what does ````'[['```` mean for, please ?? – Discus23 Jan 14 '22 at 13:10
  • 1
    It is the usual subset operator as provided in `base`, just like we subset an element from a list or vector: # x <- list(a = c(1, 2), b = c("e")) # in this case, x$a, x[["a"]], and \`[[\`(x, "a") are equivalent. Please note the enclosing backticks. You may find more details by doing " ?\`[[\` " – Vivek Atal Jan 15 '22 at 08:15
  • Oooh ok thank you, I did not know this syntax. – Discus23 Jan 17 '22 at 09:23