0

I am working with sparklyr and am having trouble changing column classes along with using dplyr to aggregate the data. This is my code currently:

.libPaths(c(.libPaths(), '/usr/lib/spark/R/lib'))
Sys.setenv(SPARK_HOME = "/usr/lib/spark")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

library(sparklyr)
library(dplyr)
library(magrittr)

sc <- sparkR.session(master = "xxxxx")
df <- read.df("path", "csv", header = "true", inferSchema = "true", na.strings = "NA")

df1<-select(df, df$DATE, df$Subject, df$Source, df$Cost, df$Test)

       DATE      Subject               Source Cost     Test
1 11/8/2016 07gjAAAAAAAq    AAAA_MOAAAGRAAAAA    2        2
2 11/8/2016 07gjAAAAAAAq      BBBB_MOBBB4BBB2    7        7
3 11/8/2016 07gjAAAAAAAq BBBB_MOBICCCCCCCCC14    2        2
4 11/8/2016 07gjAAAAAAAq SCCT_MOBIDDDDDDDDD14    1        1
5 11/8/2016 07gjAAAAAAAq    REET_MOBBBBBBBB01    2        1
6 11/8/2016 07gjAAAAAAAq      SCCT_MRRRF4RR22   11       11

Two questions based on this:

1) How do I change the DATE column to a date class. The way I did it in the past was:

df1$DATE<-as.Date(df1$DATE,'%m/%d/%Y')

This was the error:

Error in as.Date.default(df1$DATE, "%m/%d/%Y") : 
  do not know how to convert 'df1$DATE' to class “Date”

Any help would be great, thanks!

nak5120
  • 4,089
  • 4
  • 35
  • 94
  • Within `dplyr` functions, you would just refer to the bare column name and not use `$`. For example, your `group_by` statement should be `group_by(DATE, Subject)` – Jake Kaupp Feb 01 '17 at 16:25
  • I was actually able to figure this part out. It's different in sparklyr. The way you do it is: variable1<-summarize(groupBy(df1, df1$DATE, df1$`Subject`), Revenue = sum(df1$Cost), Test = mean(df1$Test)) – nak5120 Feb 01 '17 at 16:36
  • Not accordingly to [this](http://spark.rstudio.com/dplyr.html), the `dplyr` syntax remains the same. – Jake Kaupp Feb 01 '17 at 16:41
  • Do you know how to convert a column to date class? – nak5120 Feb 01 '17 at 16:46
  • I don't think it is supported just yet but you can use sql for [date comparisons](https://github.com/rstudio/sparklyr/issues/202) – LyzandeR Feb 01 '17 at 17:20
  • Thanks @LyzandeR I'll take a look into this. – nak5120 Feb 01 '17 at 17:22
  • Can dapply be used? https://spark.apache.org/docs/2.1.0/api/R/dapply.html – nak5120 Feb 01 '17 at 18:49

0 Answers0