6

I have a dataframe with crime data and associated "prices", organized by country and year (although I don't think this is important here). Here is a subset of my data:

> crime
# A tibble: 8 x 8
  iso    year  theft robbery burglary theft_price robbery_price burglary_price
  <chr> <dbl>  <dbl>   <dbl>    <dbl>       <dbl>         <dbl>          <dbl>
1 ALB    2003   3694     199      874        32.9          115           49.3
2 ALB    2004   3694     199      874        38.2          134           57.3
3 ALB    2005   3694     199      874        42.8          150           64.2
4 ALB    2006   3450     164      779        47.0          165           70.5
5 AUS    2003 722334   14634   586266       408.4         1427          612.4 
6 AUS    2004 636717   14634   512551       481.3         1683          721.2 
7 AUS    2005 598700   14634   468558       536.7         1877          804.5 
8 AUS    2006 594111   14634   433974       564.8         1973          846.5 

I want to create new columns that contain the product of each crime type with its price, so theft x theft_price = theft_prod, etc. In my actual dataset I have more crime types so I need something that is scalable to more variables than this subset contains.

I like the syntax of the dplyr package, so I to use something like this, but I cannot find the solution. I think it is not possible to reference other variables than the ones in vars(). Correct?

crime %>%
  mutate_at(vars(theft, robbery, burglary),
            funs(prod = . * ????))

Thanks.

avs
  • 617
  • 5
  • 13

2 Answers2

5

Use dplyr and tidyr:

library(dplyr); library(tidyr);

df %>% 
    gather(crime, value, -iso, -year) %>% 
    separate(crime, c('crime', 'type'), sep='_', fill = 'right') %>% 
    replace_na(list(type = 'amount')) %>% 
    spread(type, value) %>% 
    transmute(
        iso = iso, year = year, 
        crime = paste(crime, 'prod', sep = '_'), 
        prod = amount * price
    ) %>% 
    spread(crime, prod)

#  iso year burglary_prod robbery_prod  theft_prod
#1 ALB 2003       43088.2        22885    121532.6
#2 ALB 2004       50080.2        26666    141110.8
#3 ALB 2005       56110.8        29850    158103.2
#4 ALB 2006       54919.5        27060    162150.0
#5 AUS 2003   359029298.4     20882718 295001205.6
#6 AUS 2004   369651781.2     24629022 306451892.1
#7 AUS 2005   376954911.0     27468018 321322290.0
#8 AUS 2006   367358991.0     28872882 335553892.8

Another option without data reshaping, assuming the columns' names follow the crime_price convention:

library(tidyverse)
# find out the crimes columns
crimes = grep('^(?!.*_price$)', names(df)[-c(1,2)], perl = T, value = T)
# construct the crimes prices columns
crimes_prices = paste(crimes, 'price', sep = '_')
crimes_prod = paste(crimes, 'prod', sep = '_')

# loop through crime and crime price columns and multiply them
map2(crimes, crimes_prices, ~ df[[.x]] * df[[.y]]) %>% 
    set_names(crimes_prod) %>% 
    as_tibble() %>% 
    bind_cols(select(df, iso, year))

# A tibble: 8 x 5
#  theft_prod robbery_prod burglary_prod iso    year
#       <dbl>        <int>         <dbl> <fct> <int>
#1    121533.        22885        43088. ALB    2003
#2    141111.        26666        50080. ALB    2004
#3    158103.        29850        56111. ALB    2005
#4    162150         27060        54920. ALB    2006
#5 295001206.     20882718    359029298. AUS    2003
#6 306451892.     24629022    369651781. AUS    2004
#7 321322290      27468018    376954911  AUS    2005
#8 335553893.     28872882    367358991  AUS    2006
Psidom
  • 209,562
  • 33
  • 339
  • 356
3

Doing this kind of manipulation in the tidyverse is best done by making sure your data is tidy by reshaping it. A purrr approach is also possible but is likely reliant on the order of your columns, which might not always be reliable. Instead, you can do the following:

  1. gather up all your measure columns
  2. mutate a new column measure_type that indicates whether it is a count or price, and remove the _price from crime_type. Now we have separate columns for the type of crime and the metric we are using for that crime. Each row is a single iso-year-crime-metric combination.
  3. spread the crime types back out so now we have separate count and price columns for all crimes, and then multiply with mutate.
  4. (optional) if you want to put it back in your wide format, we just gather up count and price and our new product column, unite to combine with the crime type and spread back out.
library(tidyverse)
tbl <- read_table2(
"iso    year  theft robbery burglary theft_price robbery_price burglary_price
ALB    2003   3694     199      874        32.9          115           49.3
ALB    2004   3694     199      874        38.2          134           57.3
ALB    2005   3694     199      874        42.8          150           64.2
ALB    2006   3450     164      779        47.0          165           70.5
AUS    2003 722334   14634   586266       408.4         1427          612.4
AUS    2004 636717   14634   512551       481.3         1683          721.2
AUS    2005 598700   14634   468558       536.7         1877          804.5
AUS    2006 594111   14634   433974       564.8         1973          846.5"
)
tidy_tbl <- tbl %>%
  gather(crime_type, measure, -iso, - year) %>%
  mutate(
    measure_type = if_else(str_detect(crime_type, "_price$"), "price", "count"),
    crime_type = str_remove(crime_type, "_price")
    ) %>%
  spread(measure_type, measure) %>%
  mutate(product = count * price)
tidy_tbl
#> # A tibble: 24 x 6
#>    iso    year crime_type count price product
#>    <chr> <int> <chr>      <dbl> <dbl>   <dbl>
#>  1 ALB    2003 burglary     874  49.3  43088.
#>  2 ALB    2003 robbery      199 115    22885 
#>  3 ALB    2003 theft       3694  32.9 121533.
#>  4 ALB    2004 burglary     874  57.3  50080.
#>  5 ALB    2004 robbery      199 134    26666 
#>  6 ALB    2004 theft       3694  38.2 141111.
#>  7 ALB    2005 burglary     874  64.2  56111.
#>  8 ALB    2005 robbery      199 150    29850 
#>  9 ALB    2005 theft       3694  42.8 158103.
#> 10 ALB    2006 burglary     779  70.5  54920.
#> # ... with 14 more rows

tidy_tbl %>%
  gather(measure_type, measure, count:product) %>% 
  unite("colname", crime_type, measure_type) %>%
  spread(colname, measure)
#> # A tibble: 8 x 11
#>   iso    year burglary_count burglary_price burglary_product robbery_count
#>   <chr> <int>          <dbl>          <dbl>            <dbl>         <dbl>
#> 1 ALB    2003            874           49.3           43088.           199
#> 2 ALB    2004            874           57.3           50080.           199
#> 3 ALB    2005            874           64.2           56111.           199
#> 4 ALB    2006            779           70.5           54920.           164
#> 5 AUS    2003         586266          612.        359029298.         14634
#> 6 AUS    2004         512551          721.        369651781.         14634
#> 7 AUS    2005         468558          804.        376954911          14634
#> 8 AUS    2006         433974          846.        367358991          14634
#> # ... with 5 more variables: robbery_price <dbl>, robbery_product <dbl>,
#> #   theft_count <dbl>, theft_price <dbl>, theft_product <dbl>

Created on 2018-08-15 by the reprex package (v0.2.0).

Calum You
  • 14,687
  • 4
  • 23
  • 42
  • Ryan's comment to my question already solved my problem, but I'd like to understand your answer. Can you explain why your solution uses tidy data? My understanding is that data is tidy if each row contains one observational unit (in my case country-year) and each column a variable (my my case crime types and its prices). Was my starting data not tidy? – avs Aug 15 '18 at 20:45
  • the definition of tidy depends on what your proposed analysis or calculation is. a country-year is a reasonable unit here but based on your question you are interested in abstract counts and prices and products, with the crime type as a separate variable. e.g. if you wanted to plot the product by crime_type and year, the shape where a row is a country-year-crime_type would be the best – Calum You Aug 15 '18 at 20:49
  • In this case I just need the crime proceeds per crime and the total for all crimes, so the sums of all the `_prod` columns (already figured out how). Is there a disadvantage of reshaping data multiple times between long and wide format in terms of computing power? In the past I've used Stata and reshaping could take a long time if the data set was very large. – avs Aug 16 '18 at 07:03
  • I think again in this situation with the wide data you would need to do `sum()` on each `prod` column with `summarise_at` for total by crime, and then sum again for the total of all crimes. Personally I feel it is more natural to instead just do `group_by(crime_type) %>% summarise(sum = sum(product))`, i.e. don't do the part where you spread back out to wide. so you are only really reshaping once and then you keep it there. – Calum You Aug 16 '18 at 17:35
  • for computing power, I would only optimise on that if it is actually taking too long to run (don't optimise prematurely) because it usually takes longer to figure out what to do than for the computer to run the code. e.g say it takes 1 sec longer to do the reshape, but it takes 5 secs to figure out that you must edit `vars` with ryan's answer if the number of crime types changes, and also deal with any missing combinations of crimetype_measure. – Calum You Aug 16 '18 at 17:38