I am connected with impala
con <- DBI::dbConnect(odbc::odbc(), "impala connector", schema = "some_schema")
library(dplyr)
library(dbplyr) #I have to load both of them, if not tbl won't work
table <- tbl(con, 'serverTable')
I would like to use Pearson's R to track the change of a measure in time as a quick and dirty prediction model.
In locale, it works quite well, but I have problems implementing it on the server. Here's the code:
library(corrr)
table %>%
filter(!is.na(VAR) | VAR > -10 | VAR < -32) %>%
#VAR is the measure, and values over -10 or under -32 are already out of the threshold, I wanna intercept the subjects before that
mutate(num_date = as.numeric(as.POSIXct(date))) %>%
#to convert the date string into the number of seconds since 1970
group_by(id) %>%
#the measure is taken daily for various subjects, I am interested in isolating the subjects approaching the thresholds
mutate(corr = corrr::correlate(VAR, num_date)) %>%
ungroup() %>%
#here I calculare Pearson's R, I must specify corrr:: if not I get an error
filter(abs(corr) > 0.9) %>%
#in locale I found out that a value of 0.9 is good for isolating the subjects whose measure is approaching the thresholds
select(id) %>%
collect()
If I run this though, I get the error:
Error in corrr::correlate(VAR, num_date) : object 'VAR' not found.
So I tried to substitute that line with
mutate(corr = corrr::correlate(.$VAR, .$num_date)) %>%
and like this I get the error
Error in stats::cor(x = x, y = y, use = use, method = method) : supply both 'x' and 'y' or a matrix-like 'x'
if instead I try to use cor from stats, cor(VAR, num_date)
, I get the error
Error in new_result(connection@ptr, statement, immediate) : nanodbc/nanodbc.cpp:1412: HY000: [Cloudera][ImpalaODBC] (370) Query analysis error occurred during query execution: [HY000] : AnalysisException: some_schema.cor() unknown
like dbplyr can't translate cor into SQL (I see it if I run show_query() instead of collect() )
EDIT, I solved the problem using SQL:
SELECT id, cor
FROM(
SELECT id,
((tot_sum - (VAR_sum * date_sum / _count)) / sqrt((VAR_sq - pow(VAR_sum, 2.0) / _count) * (date_sq - pow(date_sum, 2.0) / _count))) AS cor
FROM (
SELECT id,
sum(VAR) AS VAR_sum,
sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sum,
sum(VAR * VAR) AS VAR_sq,
sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE) * CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sq,
sum(VAR * CAST(CAST(date_push AS TIMESTAMP) AS DOUBLE)) AS tot_sum,
count(*) as _count
FROM (
SELECT id, VAR, date
FROM (
SELECT id, VAR, date
FROM schema
WHERE VAR IS NOT NULL) AS a
WHERE VAR < -10 OR VAR > -32) AS b
GROUP BY idur) AS c) AS d
WHERE ABS(cor) > 0.9 AND ABS(cor) <= 1
thanks to this article: https://chartio.com/learn/postgresql/correlation-coefficient-pearson/