1

I'm trying to figure out how to a query to generate an ARRAY given a sliding window over a character column with Postgre.

For example, if I have this:

   pid           
   <chr>         
 1 WP_096038768.1
 2 WP_013465871.1
 3 WP_058155244.1
 4 WP_011329269.1
 5 WP_058374608.1
 6 WP_089368983.1
 7 WP_096739105.1
 8 WP_089346667.1
 9 WP_096041177.1
10 WP_010553306.1
...

I want a sliding window of size 1 before and after the row. The result is this:

   pid            g                                           
   <chr>          <chr>                                       
 1 WP_013465871.1 WP_096038768.1,WP_013465871.1,WP_058155244.1
 2 WP_058155244.1 WP_013465871.1,WP_058155244.1,WP_011329269.1
 3 WP_011329269.1 WP_058155244.1,WP_011329269.1,WP_058374608.1
 4 WP_058374608.1 WP_011329269.1,WP_058374608.1,WP_089368983.1
 5 WP_089368983.1 WP_058374608.1,WP_089368983.1,WP_096739105.1
 6 WP_096739105.1 WP_089368983.1,WP_096739105.1,WP_089346667.1
 7 WP_089346667.1 WP_096739105.1,WP_089346667.1,WP_096041177.1
 8 WP_096041177.1 WP_089346667.1,WP_096041177.1,WP_010553306.1
 9 WP_010553306.1 WP_096041177.1,WP_010553306.1,WP_007376542.1
10 WP_007376542.1 WP_010553306.1,WP_007376542.1,WP_039038284.1
...

Any hint is appreciated.

This example I did with R:

library(tidyverse)
library(dbplyr)
library(RPostgreSQL) 
library(DBI)

st2tm %>% 
  mutate(
    p1 = lag(pid),
    p2 = lead(pid)
  ) %>% 
  group_by(pid) %>% 
  mutate(g = paste(na.omit(c(p1,pid,p2)), sep = ",")) %>% 
  ungroup() %>% 
  select(-c(p1, p2)) %>% 
  filter(str_count(g,",")==2)

But when applied to a Postgres table through a DBI connection it fails with

Error in vapply(x, escape, character(1), con = con) : 
  values must be length 1,
 but FUN(X[[1]]) result is length 3

at paste and Error: str_count() is not available in this SQL variant at the filter.

Also, I think some smarter strategy.

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Aureliano Guedes
  • 767
  • 6
  • 22

1 Answers1

1

This is most likely due to dbplyr not having translations defined for converting na.omit or str_count into postgresql (a translation for paste is most likely defined).

You can replace str_count and na.omit by checking earlier for missing values.

st2tm %>% 
  mutate(
    p1 = lag(pid),
    p2 = lead(pid)
  ) %>% 
  filter(!is.na(p1),
         !is.na(p2)) %>%
  mutate(g = paste(p1, ",", pid, ",", p2)) %>% 
  select(-c(p1, p2)) %>% 

And if paste is the issue you could replace it with postgresql's inbuilt CONCAT function.

st2tm %>% 
  mutate(
    p1 = lag(pid),
    p2 = lead(pid)
  ) %>% 
  filter(!is.na(p1),
         !is.na(p2)) %>%
  mutate(g = CONCAT(p1, ",", pid, ",", p2)) %>% 
  select(-c(p1, p2)) %>% 

Because CONCAT is not an R function, dbplyr will pass it as written to postgresql rather than attempting to translate it.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
  • thanks. `CONCAT` generates a string, and if I want an `ARRAY` instead/ There is a function to do it? – Aureliano Guedes Sep 29 '20 at 20:38
  • 1
    I am not aware of any dbplyr translations that play nicely with more advanced data types like arrays. You might be able to use the same idea as `CONCAT` and pass a non-translatable cast-to-array command to dbplyr which would pass it as written to postgresql. But you would probably then then find it hard to use this column with standard dplyr verbs and you would have to keep using the non-translation trick. At that point it would be worth working in postgresql directly. – Simon.S.A. Sep 29 '20 at 23:41