2

I have a complex data frame and a minimal example is as follows:

df <- structure(list(District = c("Adilabad", "Adilabad", "Adilabad", 
                        "Adilabad", "Adilabad", "Adilabad", "Adilabad", "Adilabad", "Adilabad", 
                        "Adilabad"), Subdistt = c("Adilabad", "Adilabad", "Adilabad", 
                        "Tamsi", "Tamsi", "Tamsi", "Tamsi", "Tamsi", "Tamsi", "Tamsi"
                        ), TRU = c("Total", "Rural", "Urban", "Total", "Rural", "Urban", 
                        "Rural", "Rural", "Urban", "Urban"), Level = c("District", "District", 
                        "District", "Sub-District", "Sub-District", "Sub-District", "Village", 
                        "Village", "Town", "Town"), No_HH = c(1277, 364, 913, 
                        1277, 364, 913, 117, 247, 614, 299)), .Names = c("District", 
                        "Subdistt", "TRU", "Level", "No_HH"), row.names = c(NA, 10L), class = "data.frame")

which looks like this:

   District Subdistt   TRU        Level No_HH
1  Adilabad Adilabad Total     District  1277
2  Adilabad Adilabad Rural     District   364
3  Adilabad Adilabad Urban     District   913
4  Adilabad    Tamsi Total Sub-District  1277
5  Adilabad    Tamsi Rural Sub-District   364
6  Adilabad    Tamsi Urban Sub-District   913
7  Adilabad    Tamsi Rural      Village   117
8  Adilabad    Tamsi Rural      Village   247
9  Adilabad    Tamsi Urban         Town   614
10 Adilabad    Tamsi Urban         Town   299

Each subsequent column in a way is a kind of subset of the previous column. I have to validate if the sum of Sub-District and District at the Rural, Urban and Total level.

For eg: The sum of rows 7 and 8 is equal to the value in row 5. Row 5 is a Rural Sub-Distrit. As we extend the df, I have many rural sub-districts. The sum of all rural sub-districts is given in the Rural District, which is row 2.

A minimal expected output will be as follows:

  District Subdistt   TRU        Level No_HH
1 Adilabad    Tamsi Rural Sub-District   364
2 Adilabad    Tamsi Urban Sub-District   913

364 is a sum of 117 + 247 given in the minimal example above and 913 is the sum of sum of rows 614 + 299 given in the minimal example.

Currently I am able to subset to a particular value but don't know how to sum based on these complex selections. Can someone help?

Jaap
  • 81,064
  • 34
  • 182
  • 193
LeArNr
  • 635
  • 1
  • 6
  • 12
  • Can you show the expected output? – akrun Dec 13 '15 at 04:42
  • Are you trying to find `library(dplyr);df %>% filter(TRU!='Total') %>% group_by_(.dots=names(df)[1:2]) %>% summarise(No_HH= sum(No_HH))` – akrun Dec 13 '15 at 04:44
  • @akrun the expected output is the value of rows 1 through 6, wherein row 5 will be sum of 7 & 8 and row 5 will be sum of rows 9 & 10, row 4 will be sum of rows 5 & 6. Similar the value of row 1 will be sum of rows that has Total in TRU column and Sub-District in Level column. – LeArNr Dec 13 '15 at 05:09

1 Answers1

1

We can try

library(dplyr)
df %>%
    filter(Level=='Sub-District' & TRU != 'Total')
#  District Subdistt   TRU        Level No_HH
#1 Adilabad    Tamsi Rural Sub-District   364
#2 Adilabad    Tamsi Urban Sub-District   913

If we need to get the same output by summing,

df %>%
    filter(!grepl('District', Level)) %>% 
    group_by(District, Subdistt,TRU) %>%
    summarise(No_HH= sum(No_HH)) %>%
    mutate(Level= 'Sub_District')
#  District Subdistt   TRU No_HH        Level
#     (chr)    (chr) (chr) (dbl)        (chr)
# 1 Adilabad    Tamsi Rural   364 Sub_District
# 2 Adilabad    Tamsi Urban   913 Sub_District
akrun
  • 874,273
  • 37
  • 540
  • 662