1

I have a table with 3 columns, and a few thousand records, sample is below:

df1 <- data.frame(
  ID = c('V1', 'V1', 'V1', 'V3', 'V3', 'V3', 'V4', 'V5','V5','V5'),
  Category = c('a', 'a', 'a', 'a', 'b', 'b', 'a', 'b', 'c', 'c'),
  Amount   = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1))

Need to query with sqldf to group the data by ID and Category, then sum the amount, and only return rows when the ID has more than one category. The result would look like below:

ID  Category Amount_Sum
V3  a        1
V3  b        2
V5  b        1
V5  c        2

I tried the following code which doesnt really work, and also another row_number over partition which sort of works but is way longer than it needs to be needing many additional queries.

df2 <- sqldf::sqldf("
  SELECT   Count(*) [CNT]
           [ID], 
           [Category], 
           SUM([Amount]) [amount]
  FROM     df1
  GROUP BY [ID], 
           [Category]
  Having   Count(*) > 1")

What is the best way to get that output with sqldf in R? Thanks!

Jorge
  • 336
  • 1
  • 4
  • 15
  • 1
    Do you need to use sqldf? With `dplyr` it would be `df1 %>% group_by(ID, Category) %>% summarize(CNT = n(), amount = sum(Amount)) %>% filter(CNT > 1)` – Gregor Thomas Apr 12 '22 at 16:11
  • @GregorThomas Thank you, that is very helpful to use your dplyr code. However, we want to use SQL in case we need to move code outside of R. Also, your profile is hilarious. – Jorge Apr 12 '22 at 16:39

1 Answers1

2

You could add a join which restricts to only IDs having more than one category:

SELECT t1.ID, t1.Category, t1.Amount_Sum
FROM
(
    SELECT ID, Category, SUM(Amount) AS Amount_Sum
    FROM df1
    GROUP BY ID, Category
) t1
INNER JOIN
(
    SELECT ID
    FROM df1
    GROUP BY ID
    HAVING COUNT(DISTINCT Category) > 1
) t2
    ON t2.ID = t1.ID;
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360