0

I am connected with impala

con <- DBI::dbConnect(odbc::odbc(), "impala connector", schema = "some_schema")        
library(dplyr)
library(dbplyr) #I have to load both of them, if not tbl won't work
table <- tbl(con, 'serverTable')

I would like to use Pearson's R to track the change of a measure in time as a quick and dirty prediction model.

In locale, it works quite well, but I have problems implementing it on the server. Here's the code:

library(corrr)
table %>%
  filter(!is.na(VAR) | VAR > -10 | VAR < -32) %>%
#VAR is the measure, and values over -10 or under -32 are already out of the threshold, I wanna intercept the subjects before that
  mutate(num_date = as.numeric(as.POSIXct(date))) %>%
#to convert the date string into the number of seconds since 1970
  group_by(id) %>%
#the measure is taken daily for various subjects, I am interested in isolating the subjects approaching the thresholds
 mutate(corr = corrr::correlate(VAR, num_date)) %>%
 ungroup() %>%
#here I calculare Pearson's R, I must specify corrr:: if not I get an error
  filter(abs(corr) > 0.9) %>%
#in locale I found out that a value of 0.9 is good for isolating the subjects whose measure is approaching the thresholds
  select(id) %>%
  collect()

If I run this though, I get the error:

Error in corrr::correlate(VAR, num_date) : object 'VAR' not found.

So I tried to substitute that line with

mutate(corr = corrr::correlate(.$VAR, .$num_date)) %>%

and like this I get the error

Error in stats::cor(x = x, y = y, use = use, method = method) : supply both 'x' and 'y' or a matrix-like 'x'

if instead I try to use cor from stats, cor(VAR, num_date), I get the error

Error in new_result(connection@ptr, statement, immediate) : nanodbc/nanodbc.cpp:1412: HY000: [Cloudera][ImpalaODBC] (370) Query analysis error occurred during query execution: [HY000] : AnalysisException: some_schema.cor() unknown

like dbplyr can't translate cor into SQL (I see it if I run show_query() instead of collect() )

EDIT, I solved the problem using SQL:

SELECT id, cor
FROM(
SELECT id,
((tot_sum - (VAR_sum * date_sum / _count)) / sqrt((VAR_sq - pow(VAR_sum, 2.0) / _count) * (date_sq - pow(date_sum, 2.0) / _count))) AS cor
FROM (
SELECT id,
    sum(VAR) AS VAR_sum,
    sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sum,
    sum(VAR * VAR) AS VAR_sq,
    sum(CAST(CAST(date AS TIMESTAMP) AS DOUBLE) * CAST(CAST(date AS TIMESTAMP) AS DOUBLE)) AS date_sq,
    sum(VAR * CAST(CAST(date_push AS TIMESTAMP) AS DOUBLE)) AS tot_sum,
    count(*) as _count
FROM (
SELECT id, VAR, date
FROM (
SELECT id, VAR, date
FROM schema
WHERE VAR IS NOT NULL) AS a
WHERE VAR < -10 OR VAR > -32) AS b
GROUP BY idur) AS c) AS d
WHERE ABS(cor) > 0.9 AND ABS(cor) <= 1

thanks to this article: https://chartio.com/learn/postgresql/correlation-coefficient-pearson/

goingdeep
  • 99
  • 1
  • 9

2 Answers2

1

cor is not in the list of functions that dplyr can translate - see here: https://dbplyr.tidyverse.org/articles/sql-translation.html#known-functions

You can try the following in your code:

mutate(corr = translate_sql(corr(VAR, num_date)))

This should translate directly to CORR(VAR, num_date). These translations don't work in all database types. If you can't get this working in your case, you likely have no choice but to collect your data before you try to run non-translatable functions.

Keith McNulty
  • 952
  • 1
  • 6
  • 17
  • thanks for the answer and for the link, unfortunately it doesn't work either, I still get the same error (some_schema.corr() unknown) – goingdeep Apr 20 '21 at 11:14
1

My solution was to use dplyr's functions to replicate the correlation formula:

temp_cor = d_price_w_db %>% # your table from SQL from tbl(con, "NAME OF TABLE")
  group_by(GroupA, GroupB) %>% # Your groups 
  # And then use summarise to create the correlation. 
  # You can create as many as you like:
  summarise(cor_temp_ab = ( avg(temp_a*temp_b) - (avg(temp_a)*avg(temp_b)) ) /
                        ( sd(temp_a) * sd(temp_b) ),
  .groups = "drop"
            )

This creates the SQL query that will create your correlation coefficients. You can see it with show_query(temp_cor). Finally just do

local_object = temp_cor %>%
    collect() 

To save the result of your query in a local object.

The formula for correlation from this post: https://www.red-gate.com/simple-talk/blogs/statistics-sql-pearsons-correlation/

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
aavila
  • 11
  • 2