0

I'm using RSQlite to import Datasets from an SQlite-Database. There are multiple millions of observations within the Database. Therefor I'd like to do as much as possible of Data selection and aggregation within the Database.

At some point I need to aggregate a character variable. I want to get the value which occures the most ordered by a group. How can I edit the following dplyr-chain so it works also with RSQlite?

library(tidyverse)
library(RSQLite)

# Path to Database
DATABASE="./xxx.db"

# Connect Database
mydb <- dbConnect(RSQLite::SQLite(), DATABASE)


# Load Database
data = tbl(mydb, "BigData") 

# Query Database
Summary <- data %>% 
  filter(year==2020) %>%
  group_by(Grouping_variable) %>% 
  summarize(count=n(),
            Item_variable=names(which.max(table(Item_variable))))

Within R that code would do it's job. Querying the database I get an error code Error: near "(": syntax error

Original pipe contains more filters and steps.

Example Database would basically look like:

data.frame(Grouping_variable=c("A","A","B","C","C","C","D","D","D","D"),
           year=c(2019,2020,2019,2020,2020,2020,2020,2020,2020,2021),
           Item_variable=c("X","Y","Y","X","X","Y","Y","Y","X","X"))

   Grouping_variable year Item_Variable
1                  A 2019             X
2                  A 2020             Y
3                  B 2019             Y
4                  C 2020             X
5                  C 2020             X
6                  C 2020             Y
7                  D 2020             Y
8                  D 2020             Y
9                  D 2020             X
10                 D 2021             X

Result should look like:

 Grouping_variable count Item_variable
  <chr>             <int> <chr>        
1 A                     1 Y            
2 C                     3 X            
3 D                     3 Y      
thuettel
  • 165
  • 1
  • 11

1 Answers1

1

Assuming that DF is the data frame defined in the question and using SQL we calculate the count of each item within group in the year 2020 giving tmp and then take the row whose count is maximum giving tmp2 - SQLite guarantees that when using group by and max that the other fields come from the row where the maximum was found. Also take the sum of the counts in tmp2 and finally select just the desired columns.

library(sqldf)

sql <- "with tmp as (
  select Grouping_variable, count(*) count, Item_variable from DF
    where year = 2020
    group by Grouping_variable, Item_variable
  ),
  tmp2 as (
    select Grouping_variable, max(count), sum(count) count, Item_variable 
    from tmp
    group by Grouping_variable
  )
  select Grouping_variable, count, Item_variable
  from tmp2
"
sqldf(sql)

giving:

  Grouping_variable count Item_variable
1                 A     1             Y
2                 C     3             X
3                 D     3             Y

Added

Suppose that DF were a table in your database. This code creates such a database.

library(RSQLite)
m <- dbDriver("SQLite")
con <- dbConnect(m, dbname = "database.sqlite")
dbWriteTable(con, 'DF', DF, row.names = FALSE)
dbDisconnect(con)

then this would run the sql command in the sql string defined above on that database and return the result.

library(RSQLite)
m <- dbDriver("SQLite")
con <- dbConnect(m, dbname = "database.sqlite")
result <- dbGetQuery(con, sql)
dbDisconnect(con)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks! But you're approach is only using SQL, right? I would like to stay with mainly R, since that is what I'm familiar with. Is there a way to only substitute the command in question with an sql() commad? `Item_Variable=sql()` or at least the summarize part? There are multiple filters and mutates ahead and following the part in question. – thuettel Dec 07 '21 at 16:17
  • The answer shown is R code. sqldf itself uses RSQLite but you can use it directly if DF is already a table in an sqlite database. See the new Added section which creates such a database for testing purposes and then runs the SQL statement against. it. – G. Grothendieck Dec 07 '21 at 16:55
  • My problem is that with your approach I need to translate all comands before the summarize to SQL to be able to do the querry. The example above is simplyfied, but in reality there are multiple commands (filter, select, slice etc.) ahead of the summarize. That's why I would like to only subsitute the command in question and integrate it into the R-Pipleline. Like: Summary <- data %>% filter(year==2020) %>% group_by(Grouping_variable) %>% summarize(count=n(), Item_variable=sql(SQL_STATEMENT_HERE)) I thought that might be possible within dplyr. – thuettel Dec 07 '21 at 17:05
  • sql is easier to learn and there are more resources to learn it than dplyr so if you know dplyr you can certainly learn sql. – G. Grothendieck Dec 07 '21 at 17:31