Conditional mutate cumsum dlpyr

Question

I have towns (from A to D), which have different populations, and are at different distances. The objective is to add up the total population living within the circle of radius (distance XY) where X is a town in the centre of the circle and Y any other town.

In this code:

    Df <- structure(list(Town_From = c("A", "A", "A", "B", "B", "C"), Town_To = c("B", 
    "C", "D", "C", "D", "D"), Distance = c(10, 5, 18, 17, 20, 21)), .Names = c("Town_From", 
    "Town_To", "Distance"), row.names = c(NA, -6L), class = "data.frame")

    Df2 <- structure(list(Town = c("A", "B", "C", "D"), Population = c(1000, 
    800, 500, 200)), .Names = c("Town", "Population"), row.names = c(NA, 
    -4L), class = "data.frame")

    Df <- Df %>% left_join(Df2,by=c("Town_From"="Town")) %>% 
      left_join(Df2,by=c("Town_To"="Town"))%>%
      group_by(Town_From) %>% 
      arrange(Distance)
    colnames(Df)[4]<-c("pop_TF")
    colnames(Df)[5]<-c("pop_TT")
Source: local data frame [6 x 5]
Groups: Town_From [3]

  Town_From Town_To Distance pop_TF pop_TT
      <chr>   <chr>    <dbl>  <dbl>  <dbl>
1         A       C        5   1000    500
2         A       B       10   1000    800
3         B       C       17    800    500
4         A       D       18   1000    200
5         B       D       20    800    200
6         C       D       21    500    200

towns have been organised by (Town_From) and arranged by (distance).

Within the circle of 5km radius (from A to C) live 1000 (in A) + 500 (in C)= 1500 people; within the next circle live 1500 + 800 (in B) =2300. Within the third circle still live 2300 people because towns A, B, C are within the circle radius B to C = 17 km. Within the Circle radius A to D = 18km, live 2300 + 200 (in D)=2500people.

Here is a visualization of the circles in question. In theory, the circles could expand to any arbitrary radius. In practice, I only need to check them at the distances between pairs of towns (places where the counts change).

Is your objective to compute the sum of the population as a function of the distance from each town (center of circle)? If so, then we can (i) group by each `Town_From`, (ii) sort each of these by `Distance`, and then (iii) compute the `cumsum`. — aichao, Jan 18 '17 at 13:36
Given the answer from @aichao, it is clear there is some ambiguity in your question. Where are your circles centered? I interpreted that each town should be the center of its own set of circles. @ aichao seems to have worked directly from the format of the data you created. Your answer seems to conflate the circle centered at A with radius of 17km with the distance from B to C (also 17km), while if the towns were in a line, (A to B) + (B to C) could be > 17km. This reading implies that you want to include any city that is within Xkm of any other city (not necessarily within a single circle). — Mark Peterson, Jan 18 '17 at 14:48
Hello @aichao, thanks for asking. This question is very similar to one you answer before and was marked as solved, so I took the same data used in that question. The difference there is that some of the town within a distance were not added to the total population using cumsum. That's is exactly what I need to avoid here, and I think Mark has found the way to do so. Thanks aichao! — JPV, Jan 19 '17 at 00:55
Of course C is excluded in the circle centered at B with radius 10km -- it is 17km away from B. Imagine this arrangement `C-A-B` with each `-` being 4km. From A, all three cities are within 5km. But from B, only A and B are. (there is a separate issue that the made up distances in your example data don't quite reflect a possible reality.) If you want to ask a new question, do that. Don't change what you are asking for and un-accept an answer. — Mark Peterson, Jan 20 '17 at 01:30
Here is a map illustrating my point: http://i.imgur.com/ZpTUVER.png . Note that all of the circles have the same radius, but while the circle centered at A contains A, B, and C, the circles are B and C each contain only A and themselves. If you want something other than those circles (which is what your original question asked for), draw it on this map, ask a new question, and revert this one. (Note that this map arrangement is as close as possible to the pairwise distances in your original post. The B-C distance of 17 is not possible as B-A + A-C is 10 + 5 = 15.) — Mark Peterson, Jan 20 '17 at 13:08
I rolled this back to remove the unexplained change from the OP and to add a description of the circle behavior that matches the original. I tried to get OP to explain, but ze disappeared for the past 3 weeks. — Mark Peterson, Feb 08 '17 at 12:49

Mark Peterson · Answer 1 · 2017-01-19T02:47:55.623

For this, it is easier if you can put your data into a format where each town is represented on each "end" of the distance (both the to and the from). So, I changed the change you made at the end to Df to this instead. Note that it uses complete from tidyr.

Df_full <-
  Df %>%
  bind_rows(
    select(Df, Town_From = Town_To, Town_To = Town_From, Distance)
  ) %>%
  complete(Town_From, Town_To, fill = list(Distance = 0)) %>%
  left_join(Df2, c("Town_To" = "Town"))

This reverses the to-from relationship and appends it to the bottom of the list. Then, it uses complete to add the town as its own "To" (e.g., From A to A). Finally, it joins the populations in, but they now only need to be added once. Here is the new data:

# A tibble: 16 × 4
   Town_From Town_To Distance Population
       <chr>   <chr>    <dbl>      <dbl>
1          A       A        0       1000
2          A       B       10        800
3          A       C        5        500
4          A       D       18        200
5          B       A       10       1000
6          B       B        0        800
7          B       C       17        500
8          B       D       20        200
9          C       A        5       1000
10         C       B       17        800
11         C       C        0        500
12         C       D       21        200
13         D       A       18       1000
14         D       B       20        800
15         D       C       21        500
16         D       D        0        200

Next, we set the thresholds we want to explore. In your question, you imply that you want to use each of the unique pair-wise distances. If you prefer some other set for your production use, just enter them here.

radiusCuts <-
  Df_full$Distance %>%
  unique %>%
  sort

Then, we construct a sum command that will sum only paired cities within the radius, setting the names in the process to ease the use of summarise_ in a moment.

forPops <-
  radiusCuts %>%
  setNames(paste("Pop within", ., "km")) %>%
  lapply(function(x){
    paste("sum(Population[Distance <=", x,"])")
  })

Finally, we group_by the Town_From and pass those constructed arguments to the standard evaluation function summarise_ which will create each of the columns in forPops:

Df_full %>%
  group_by(Town_From) %>%
  summarise_(.dots = forPops)

gives:

# A tibble: 4 × 8
  Town_From `Pop within 0 km` `Pop within 5 km` `Pop within 10 km` `Pop within 17 km` `Pop within 18 km` `Pop within 20 km` `Pop within 21 km`
      <chr>             <dbl>             <dbl>              <dbl>              <dbl>              <dbl>              <dbl>              <dbl>
1         A              1000              1500               2300               2300               2500               2500               2500
2         B               800               800               1800               2300               2300               2500               2500
3         C               500              1500               1500               2300               2300               2300               2500
4         D               200               200                200                200               1200               2000               2500

Which should give you all the thresholds you want.

Hi @Mark Peterson, thanks for your answer. This looks like the result I am looking for. Before marking it as solved, the function complete you used to generate the data, is it part of the dplyr package? - R consistently tell me it cannot find it. My apologies if this is too basic, but I am not a daily user of R. Thank you! -EDIT: I found it. complete is a function of package tidyr. — JPV, Jan 19 '17 at 01:25
Good catch, and sorry I missed the `tidyr` dependency. I generally load `tidyverse` which has several of those packages automatically load. Edited now. — Mark Peterson, Jan 19 '17 at 02:49
Thanks, Mark. May I ask you last thing? - If I want to discount the population in the "Town_From" and in the "Town_to", from the population within distance x, Do I have to do that within the {sum} command? — JPV, Jan 19 '17 at 03:57
I'm not sure of your meaning. If you mean you want to exclude the population of the town at which your circle is based, exclude the `complete` argument. If you want to exclude the town on the edge of the radius (e.g., town B from A at 10 miles), change from `<=` to just `<` in the `forPops`. If something more complicated, you may want to ask a new question. — Mark Peterson, Jan 19 '17 at 04:21
It is exactly that, but excluding both at the same time, the town at which the circle is based and the one on the edge. your recommendation worked perfect. Thanks! — JPV, Jan 19 '17 at 04:26

score 1 · Answer 2 · answered Jan 18 '17 at 14:24

1

If your objective is to compute the sum of the population as a function of increasing distance from each town (at the center of the circle), then we can (i) group by Town_From, (ii) sort each of these groups by Distance, and then (iii) compute the cumsum. Using dplyr:

library(dplyr)
res <- Df %>% group_by(Town_From) %>% 
              arrange(Distance) %>% 
              mutate(sumPop=pop_TF+cumsum(pop_TT))

Using your data, the result is:

print(res)
##Source: local data frame [6 x 6]
##Groups: Town_From [3]
##
##  Town_From Town_To Distance pop_TF pop_TT sumPop
##    <chr>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>
##1         A       C        5   1000    500   1500
##2         A       B       10   1000    800   2300
##3         B       C       17    800    500   1300
##4         A       D       18   1000    200   2500
##5         B       D       20    800    200   1500
##6         C       D       21    500    200    700

answered Jan 18 '17 at 14:24

aichao

7,375
3
16
18

Should the circle from C to D of 21 miles not also include the populations of A and B (which are 5 and 17 miles from C, respectively)? – Mark Peterson Jan 18 '17 at 14:43
@MarkPeterson Yes, to do this correctly, the input data must reflect symmetry in the sense that there should be data From C to B and from C to A. Then the above code will work as expected. That is, there is nothing wrong with the logic of the code, it is the input data that has to be correct. Another viewpoint is that the data is putting a constraint on computing the total population from C by omitting the data from C to B and A. If you strongly against this view, I will delete. – aichao Jan 18 '17 at 14:48
Ok, then I think we agree more than I thought on my first read of your question. If I apply the logic of your code to the symmetrical data I generated in my answer (e.g., `Df_full %>% group_by(Town_From) %>% arrange(Town_From, Distance) %>% mutate(sumPop=cumsum(Population))` ) it does give each of those cutoffs correctly (arranging by `Town_From` is just for display). However, you would run into some problems if there are two towns that are equidistant from a single town (e.g., if town E is also 18 miles from A). – Mark Peterson Jan 18 '17 at 14:59
@MarkPeterson Yes I agree, then using `cumsum` itself is the issue. – aichao Jan 18 '17 at 15:01

Conditional mutate cumsum dlpyr

2 Answers2