Aggregate is deleting rows of the data

Question

I'm new with R and I have tried a lot to solve this problem, if anyone could help me I'd be very grateful! This is my problem:

I have to work with timeseries of a product that are separated by year, type (import or export from the country) and the size of the products in kilograms, something like this:

dat<-data.frame(NAME=c("P1","P1","P2","P2","P1","P2","P1","P1") , YEAR =c(1991,1991,1991,1991,1992,1992,1993,1993), TYPE=c("IMPORT","EXPORT","IMPORT","EXPORT","IMPORT","EXPORT","IMPORT","EXPORT"), VALUE=c(300,200,170,150,150,120,90,100))
dat
#  NAME YEAR   TYPE VALUE
#1   P1 1991 IMPORT   300
#2   P1 1991 EXPORT   200
#3   P2 1991 IMPORT   170
#4   P2 1991 EXPORT   150
#5   P1 1992 IMPORT   150
#6   P2 1992 EXPORT   120
#7   P1 1993 IMPORT    90
#8   P1 1993 EXPORT   100

So, what I have to do is to get the difference between the importations and exportations of the product for all the years and products in the data. It should look like this

solution<-data.frame(NAME=c("P1","P2","P1","P2","P1"),Year=c(1991,1991,1992,1992,1993),VALUE=c(100,20,150,-120,-10))
solution

#  NAME Year VALUE
#1   P1 1991   100
#2   P2 1991    20
#3   P1 1992   150
#4   P2 1992  -120
#5   P1 1993   -10

I used aggregate to solve it but when I do it, the code deletes the product p1 and p2 in 1992 because there's no exportations for P1 or importations for p2 in that year. Does anyone know how to solve it?

This is part of my code:

agg<-sort(data, f= ~ year + name)
agg<-aggregate(size~year + name, data=data, FUN=diff)

Matias Andina · Accepted Answer · 2019-08-13T14:02:07.860

3

This is the tidyverse way of doing it

library(tidyr)
library(dplyr)
spread(dat, TYPE, VALUE) %>%
  mutate(IMPORT = ifelse(is.na(IMPORT), 0, IMPORT),
         EXPORT = ifelse(is.na(EXPORT), 0, EXPORT),
         diff = IMPORT - EXPORT)

which produces

  NAME YEAR EXPORT IMPORT diff
1   P1 1991    200    300  100
2   P1 1992      0    150  150
3   P1 1993    100     90  -10
4   P2 1991    150    170   20
5   P2 1992    120      0 -120

Update

I totally forgot about the fill argument in spread, thanks to the comment, the code can be reduced to

spread(dat, TYPE, VALUE, fill=0) %>%
  mutate(diff = IMPORT - EXPORT)

edited Aug 13 '19 at 14:02

answered Aug 12 '19 at 21:10

Matias Andina

4,029
4
26
58

1

There is `fill` argument in `spread`. If you use that then the code will be reduced to `spread(dat, TYPE, VALUE, fill = 0) %>% mutate(diff = IMPORT - EXPORT) ` – Ronak Shah Aug 13 '19 at 04:23

Lisardo Erman · Answer 2 · 2019-08-13T15:43:15.650

First I reshape the data to split the column "TYPE" into one export and one import column. Look up the basic concepts of reshaping (here we reshape from long to wide).

new_dat <- reshape(dat, idvar = c("NAME","YEAR"), timevar = "TYPE", direction = "wide")

Second, I want to substract the export column from the import column. Normally this would just be: new_dat$ex_im <- new_dat$VALUE.IMPORT - new_dat$VALUE:EXPORT. But because we want to change the NA values for zeros, I replace the export and import column with zero if it is missing and let it untouched if not: ifelse(is.na(new_dat$VALUE.IMPORT),0,new_dat$VALUE.IMPORT)

new_dat$ex_im <- ifelse(is.na(new_dat$VALUE.IMPORT),0,new_dat$VALUE.IMPORT) - ifelse(is.na(new_dat$VALUE.EXPORT),0,new_dat$VALUE.EXPORT)

new_dat
#  NAME YEAR VALUE.IMPORT VALUE.EXPORT ex_im
#1   P1 1991          300          200   100
#3   P2 1991          170          150    20
#5   P1 1992          150           NA   150
#6   P2 1992           NA          120  -120
#7   P1 1993           90          100   -10

Could you explain some of what this code does, and why it works? — camille, Aug 12 '19 at 21:04

score 1 · Answer 3 · answered Aug 13 '19 at 04:16

One option without reshaping is to use tidyr::complete to add missing observations in TYPE column.

library(dplyr)

dat %>%
  group_by(NAME, YEAR) %>%
  tidyr::complete(TYPE, fill = list(VALUE = 0)) %>%
  summarise(VALUE = sum(VALUE[TYPE == "IMPORT"]) - sum(VALUE[TYPE == "EXPORT"]))
  #If you would always have one entry for "IMPORT" and "EXPORT" in each group use
  #summarise(VALUE = VALUE[TYPE == "IMPORT"] - VALUE[TYPE == "EXPORT"])

#  NAME   YEAR VALUE
#  <fct> <dbl> <dbl>
#1 P1     1991   100
#2 P1     1992   150
#3 P1     1993   -10
#4 P2     1991    20
#5 P2     1992  -120

score 0 · Answer 4 · answered Aug 12 '19 at 20:17

d = Reduce(function(x, y) merge(x, y, by = c("NAME", "YEAR"), all = TRUE),
       split(dat, dat$TYPE))
d$VALUE = replace(d$VALUE.y, is.na(d$VALUE.y), 0) - replace(d$VALUE.x, is.na(d$VALUE.x), 0)
d
#  NAME YEAR TYPE.x VALUE.x TYPE.y VALUE.y VALUE
#1   P1 1991 EXPORT     200 IMPORT     300   100
#2   P1 1992   <NA>      NA IMPORT     150   150
#3   P1 1993 EXPORT     100 IMPORT      90   -10
#4   P2 1991 EXPORT     150 IMPORT     170    20
#5   P2 1992 EXPORT     120   <NA>      NA  -120

Could you explain some of what this code does, and why it works? — camille, Aug 12 '19 at 21:04

score 0 · Answer 5 · answered Aug 12 '19 at 20:24

You can do this by converting your data to wide format and subtracting the columns. Here I use dcast from data.table to convert to wide.

Note: ifelse(is.na(IMPORT), 0, IMPORT) can be simplified to coalesce(IMPORT, 0), but the coalesce function is still in the dev version of data.table so I did not include it. (if you want, dplyr has a coalesce function, or you could download the dev version of data.table from github)

library(data.table)
setDT(dat)
dat_wide <- dcast(dat, ... ~ TYPE, value.var = 'VALUE')
dat_wide[, diff := ifelse(is.na(IMPORT), 0, IMPORT) - ifelse(is.na(EXPORT), 0 , EXPORT)]


dat_wide
#    NAME YEAR EXPORT IMPORT diff
# 1:   P1 1991    200    300  100
# 2:   P1 1992     NA    150   NA
# 3:   P1 1993    100     90  -10
# 4:   P2 1991    150    170   20
# 5:   P2 1992    120     NA   NA

Aggregate is deleting rows of the data

5 Answers5

Update