1

I'm working with census data and I need to calculate total values based on a selection of block groups (rows) for a large number of fields that are grouped together with similar starting patterns. Unfortunately these new calculations aren't all sums of rows within a column values but require recombinations of some of the sums from other columns. My main question is there a way to use pattern matching to do these sums over groups that start the same in R and either create a new row of these recalculated values or a single row data frame of these values.

GEOID        | TOTPOP_EST | TOTPOP_CV | TOTPOP_SE | TOTPOP_SESQD | TOTPOP_MOE | BLK_EST | BLK_CV | ...
180816114005 | 1301       | 0.0916    | 119.15    | 14196.47     | 196        | 184     | 0.2341 | ...
.
.
.
180816112003 | 802        | 0.1342    | 107.60    | 11577.50     | 177        | 2       | 3.8447 | ...
(calculated values & equations below)         
Total        | 12071      | 0.0557    | 627.30    | 451983.6     | 1106       | 1589    | 0.1685
TOTAL |sum(TOTPOP_EST)|sqrt(sum(TOTPOP_SESQD)/sum(TOTPOP_EST)|sqrt(sum(TOTPOP_SESQD)|sum(TOTPOP_SESQD)|sqrt(sum(TOTPOP_SESQD)/sum(TOTPOP_EST)*1.645|sum(BLK_EST)| ...

I want to do these same calculations for the sets that start with BLK, NATAM, ASN, etc... without having to manually write the equations out for each column. Any thoughts?

EDITS Here is the first couple of rows from dput...it's actually 60 columns total but I truncated it to the first 23 columns to get three full sets of values. The first four columns are informational about the block groups and there are two extra columns to be calculated that I didn't mention in my original post the _LB (lower bound) is just the summed _EST minus the calculated _MOE (margin of error) value while _UB (upper bound) is _EST + _MOE.

dput(head(st_drop_geometry(selected[0:23])))

structure(list(OBJECTID = c(132L, 487L, 351L, 168L, 169L, 240L
), BLKGRPID = c("1500000US180816114005", "1500000US180816112003", 
"1500000US180050111002", "1500000US180050111006", "1500000US180050111004", 
"1500000US180816113001"), GEOID = c("180816114005", "180816112003", 
"180050111002", "180050111006", "180050111004", "180816113001"
), NAME = c("Block Group 5, Census Tract 6114, Johnson County, Indiana", 
"Block Group 3, Census Tract 6112, Johnson County, Indiana", 
"Block Group 2, Census Tract 111, Bartholomew County, Indiana", 
"Block Group 6, Census Tract 111, Bartholomew County, Indiana", 
"Block Group 4, Census Tract 111, Bartholomew County, Indiana", 
"Block Group 1, Census Tract 6113, Johnson County, Indiana"), 
    TOTPOPRACE_EST = c(1301, 802, 2618, 2286, 1015, 996), TOTPOPRACE_CV = c(0.0915825796850213, 
    0.134163072561757, 0.109831166509157, 0.150779024962305, 
    0.272508122838277, 0.199582524627385), TOTPOPRACE_SE = c(119.148936170213, 
    107.598784194529, 287.537993920973, 344.68085106383, 276.595744680851, 
    198.784194528875), TOTPOPRACE_SESQD = c(14196.4689904934, 
    11577.4983601408, 82678.0979480973, 118804.889090086, 76505.2059755545, 
    39515.1559944938), TOTPOPRACE_MOE = c(196, 177, 473, 567, 
    455, 327), TOTPOPRACE_LB = c(1105, 625, 2145, 1719, 560, 
    669), TOTPOPRACE_UB = c(1497, 979, 3091, 2853, 1470, 1323
    ), HSPBIPOC_EST = c(291, 63, 727, 834, 138, 290), HSPBIPOC_CV = c(0.210690175299478, 
    0.703866308673204, 0.323402750899361, 0.311249757908781, 
    0.925363006422687, 0.589156134019651), HSPBIPOC_SE = c(61.3108410121482, 
    44.3435774464119, 235.113799903835, 259.582298095923, 127.700094886331, 
    170.855278865699), HSPBIPOC_SESQD = c(3759.01922561691, 1966.35286074593, 
    55278.4989052208, 67382.9694847608, 16307.3142339779, 29191.5263162757
    ), HSPBIPOC_MOE = c(101, 73, 387, 427, 210, 281), HSPBIPOC_LB = c(190, 
    0, 340, 407, 0, 9), HSPBIPOC_UB = c(392, 136, 1114, 1261, 
    348, 571), HISPANIC_SE = c(19.4528875379939, 31.6109422492401, 
    213.981762917933, 258.966565349544, 126.44376899696, 119.148936170213
    ), HISPANIC_SESQD = c(378.414833565839, 999.251669884794, 
    45788.1948614665, 67063.6819689397, 15988.0267181567, 14196.4689904934
    ), HISPANIC_MOE = c(32, 52, 352, 426, 208, 196), HISPANIC_LB = c(6, 
    0, 217, 408, 0, 0), HISPANIC_UB = c(70, 86, 921, 1260, 346, 
    329)), row.names = c(132L, 487L, 351L, 168L, 169L, 240L), class = "data.frame")
GuedesBF
  • 8,409
  • 5
  • 19
  • 37
Keren Crum
  • 11
  • 2
  • 2
    It would be great, if you could give use the first couple of rows with `dput(head(your_data_frame))` and also ideally the desired output for one group `EST`. Give the {dplyover} package a look, it can help with `over()` and `.()` as in this example: https://timteafan.github.io/dplyover/reference/string_eval.html – TimTeaFan Oct 03 '22 at 19:55

1 Answers1

0

I made the {dplyover} package for exactly this kind of use cases. We can use over() for this problem. We get the variable stems with cut_names() and then construct the full variable names with .():

library(dplyr)
library(dplyover) # https://timteafan.github.io/dplyover/

dat %>% 
  summarise(
    over(cut_names("_EST$"), # <- get all variable stems for which there is an "_EST" ending
         list(EST_TOTAL = ~ sum(.("{.x}_EST")),
              SESQD_EST_SQRT  = ~ sqrt(sum(.("{.x}_SESQD"))/sum(.("{.x}_EST"))),
              SESQD_SQRT  = ~ sqrt(sum(.("{.x}_SESQD"))),
              SESQD_TOTAL = ~ sum(.("{.x}_SESQD")),
              SESQD_EST_SQRT2  = ~ sqrt(sum(.("{.x}_SESQD"))/sum(.("{.x}_EST"))) * 1.645
              )
         )
  ) %>% glimpse # for better printing on SO (delete this line)

#> Rows: 1
#> Columns: 10
#> $ TOTPOPRACE_EST_TOTAL       <dbl> 9018
#> $ TOTPOPRACE_SESQD_EST_SQRT  <dbl> 6.169748
#> $ TOTPOPRACE_SESQD_SQRT      <dbl> 585.8987
#> $ TOTPOPRACE_SESQD_TOTAL     <dbl> 343277.3
#> $ TOTPOPRACE_SESQD_EST_SQRT2 <dbl> 10.14924
#> $ HSPBIPOC_EST_TOTAL         <dbl> 2343
#> $ HSPBIPOC_SESQD_EST_SQRT    <dbl> 8.614811
#> $ HSPBIPOC_SESQD_SQRT        <dbl> 416.996
#> $ HSPBIPOC_SESQD_TOTAL       <dbl> 173885.7
#> $ HSPBIPOC_SESQD_EST_SQRT2   <dbl> 14.17136

OPs data

structure(list(OBJECTID = c(132L, 487L, 351L, 168L, 169L, 240L
), BLKGRPID = c("1500000US180816114005", "1500000US180816112003", 
                "1500000US180050111002", "1500000US180050111006", "1500000US180050111004", 
                "1500000US180816113001"), GEOID = c("180816114005", "180816112003", 
                                                    "180050111002", "180050111006", "180050111004", "180816113001"
                ), NAME = c("Block Group 5, Census Tract 6114, Johnson County, Indiana", 
                            "Block Group 3, Census Tract 6112, Johnson County, Indiana", 
                            "Block Group 2, Census Tract 111, Bartholomew County, Indiana", 
                            "Block Group 6, Census Tract 111, Bartholomew County, Indiana", 
                            "Block Group 4, Census Tract 111, Bartholomew County, Indiana", 
                            "Block Group 1, Census Tract 6113, Johnson County, Indiana"), 
TOTPOPRACE_EST = c(1301, 802, 2618, 2286, 1015, 996), TOTPOPRACE_CV = c(0.0915825796850213, 
                                                                        0.134163072561757, 0.109831166509157, 0.150779024962305, 
                                                                        0.272508122838277, 0.199582524627385), TOTPOPRACE_SE = c(119.148936170213, 
                                                                                                                                 107.598784194529, 287.537993920973, 344.68085106383, 276.595744680851, 
                                                                                                                                 198.784194528875), TOTPOPRACE_SESQD = c(14196.4689904934, 
                                                                                                                                                                         11577.4983601408, 82678.0979480973, 118804.889090086, 76505.2059755545, 
                                                                                                                                                                         39515.1559944938), TOTPOPRACE_MOE = c(196, 177, 473, 567, 
                                                                                                                                                                                                               455, 327), TOTPOPRACE_LB = c(1105, 625, 2145, 1719, 560, 
                                                                                                                                                                                                                                            669), TOTPOPRACE_UB = c(1497, 979, 3091, 2853, 1470, 1323
                                                                                                                                                                                                                                            ), HSPBIPOC_EST = c(291, 63, 727, 834, 138, 290), HSPBIPOC_CV = c(0.210690175299478, 
                                                                                                                                                                                                                                                                                                              0.703866308673204, 0.323402750899361, 0.311249757908781, 
                                                                                                                                                                                                                                                                                                              0.925363006422687, 0.589156134019651), HSPBIPOC_SE = c(61.3108410121482, 
                                                                                                                                                                                                                                                                                                                                                                     44.3435774464119, 235.113799903835, 259.582298095923, 127.700094886331, 
                                                                                                                                                                                                                                                                                                                                                                     170.855278865699), HSPBIPOC_SESQD = c(3759.01922561691, 1966.35286074593, 
                                                                                                                                                                                                                                                                                                                                                                                                           55278.4989052208, 67382.9694847608, 16307.3142339779, 29191.5263162757
                                                                                                                                                                                                                                                                                                                                                                     ), HSPBIPOC_MOE = c(101, 73, 387, 427, 210, 281), HSPBIPOC_LB = c(190, 
                                                                                                                                                                                                                                                                                                                                                                                                                                       0, 340, 407, 0, 9), HSPBIPOC_UB = c(392, 136, 1114, 1261, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                           348, 571), HISPANIC_SE = c(19.4528875379939, 31.6109422492401, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      213.981762917933, 258.966565349544, 126.44376899696, 119.148936170213
                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ), HISPANIC_SESQD = c(378.414833565839, 999.251669884794, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 45788.1948614665, 67063.6819689397, 15988.0267181567, 14196.4689904934
                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ), HISPANIC_MOE = c(32, 52, 352, 426, 208, 196), HISPANIC_LB = c(6, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            0, 217, 408, 0, 0), HISPANIC_UB = c(70, 86, 921, 1260, 346, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                329)), row.names = c(132L, 487L, 351L, 168L, 169L, 240L), class = "data.frame") ->
  dat

Created on 2022-10-04 by the reprex package (v2.0.1)

TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
  • 1
    Thanks Tim TeaFan!! Your dplyrover package did the trick!!! I did realize I wrote the equations wrong for CV and MOE in my initial post, but I fixed them in my code to the correct equations and it all worked like a charm. Thanks for building that and answering my question. – Keren Crum Oct 04 '22 at 16:09