4

Im kinda newbie in R right now... So im doing a census research as an university project. for illustration that is part of my data.frame

             MUN          X1990  X1991  X1992 X1993
1     Angra dos Reis (RJ)    11    10    10    10
2            Aperibé (RJ)    NA    NA    NA    NA
3           Araruama (RJ)  12040 14589 14231 14231
4              Areal (RJ)    NA    NA    NA     3
5 Armação dos Búzios (RJ)    NA    NA    NA    NA

My problem is that i need to sum some municipalities rows whose name i know/will specify, (because i dont know the order it will apear, or if they ever will apear, in all of my tables), and the result should be displayed in a row.

as an example, i would like to sum the row "Areal" with the row "Angra dos Reis", with the result stored in another created row( let's call the result row: X) so the result should be:

             MUN          X1990  X1991  X1992 X1993
1     Angra dos Reis (RJ)    11    10    10    10
2            Aperibé (RJ)    NA    NA    NA    NA
3           Araruama (RJ)  12040 14589 14231 14231
4              Areal (RJ)    NA    NA    NA     3
5 Armação dos Búzios (RJ)    NA    NA    NA    NA
6          X                 11    10    10    13

I have tried to create a for loop and an if loop, but i can't get to do that right.

tnorio
  • 61
  • 1
  • 7
  • `rbind(mydf, data.frame(MUN = 'X', as.data.frame.list(colSums(mydf[c(1,4), -1], na.rm = TRUE))))` – Jaap Oct 23 '17 at 18:23
  • thanks bro, it seems to work. but could you explain me the code? especially that part `as.data.frame.list(colSums(mydf[c(1,4), -1]` colSums? and why the c(1,4), -1? -1, why is that? – tnorio Oct 23 '17 at 19:36
  • to see what it does, disect from inside out: first see what `mydf[c(1,4), -1]` does, then wrap it in `colSums`, etc – Jaap Oct 23 '17 at 20:05
  • ooh, i got it, thank you. but the thing is i dont know the row that the municipalities i want to sum will apear,or if they ever will, so if i dont count the 1st colum `mydf[c(1,4), -1]` i can't call them by name. and if i take the `-1` out, i get an error :**x must be numeric**. i tried to change `as.data.frame.numeric` but it doesnt work – tnorio Oct 23 '17 at 23:17

4 Answers4

5

This is pretty similar to Jaap's comment, but a little more spelled out and uses the row names explicitly:

mat = as.matrix(dat[, 2:5])
row.names(mat) = dat$MUN
mat = rbind(mat, colSums(mat[c("Angra dos Reis (RJ)", "Areal (RJ)"), ], na.rm = T))
row.names(mat)[nrow(mat)] = "X"
mat
#                         X1990 X1991 X1992 X1993
# Angra dos Reis (RJ)        11    10    10    10
# Aperibé (RJ)               NA    NA    NA    NA
# Araruama (RJ)           12040 14589 14231 14231
# Areal (RJ)                 NA    NA    NA     3
# Armação dos Búzios (RJ)    NA    NA    NA    NA
# X                          11    10    10    13

The result is a matrix, you can convert it back to a data frame if needed:

dat_result = data.frame(MUN = row.names(mat), mat, row.names = NULL)

I dislike the format of your data as a data frame. I would either convert it to a matrix (as above) or convert it to long format with, e.g., tidyr::gather(dat, key = year, value = value, -MUN) and work with it "by group" using data.table or dplyr.


Using this data:

dat = read.table(text = "             MUN          X1990  X1991  X1992 X1993
1     'Angra dos Reis (RJ)'    11    10    10    10
2            'Aperibé (RJ)'    NA    NA    NA    NA
3           'Araruama (RJ)'  12040 14589 14231 14231
4              'Areal (RJ)'    NA    NA    NA     3
5 'Armação dos Búzios (RJ)'    NA    NA    NA    NA", header= T)
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • i dont know much about matrix, but if i use one i wont be able to use characters and numbers right? as far as i know thats the diference between matrix and data frame. anyway i see what u did there, making a variable to the row names in the dataframe. i just didn't get how we are able to call the rows by name if you started the count by the second collum`mat = as.matrix(dat[, 2:5])`. Also if you could explain this last line for me `row.names(mat)[nrow(mat)] = "X"` – tnorio Oct 23 '17 at 23:37
  • You're right about what a matrix is - only one data type. But unless you have more columns, the way you are using the `MUN` feels more like a row name than a column. I was able to call the rows by name thanks to setting the names with `row.names(mat) = dat$MUN`. – Gregor Thomas Oct 24 '17 at 04:38
  • The last bit, `row.names(mat)[nrow(mat)] = "X"`, is just setting the name of the last row to "X". `nrow(mat)` is the number of rows - that is, the number of the last row. – Gregor Thomas Oct 24 '17 at 04:40
2

A solution can be using sqldf package. If the name of the data frame is df, you can do it likes the following:

library(sqldf)
result <- sqldf("SELECT * FROM df UNION 
       SELECT 'X', SUM(X1990), SUM(X1991), SUM(X1992), SUM(X1993) FROM df
       WHERE MUN IN ('Angra dos Reis (RJ)', 'Areal (RJ)')")
OmG
  • 18,337
  • 10
  • 57
  • 90
2

Here is a dplyr solution:

library(dplyr)
df %>%
  filter(MUN %in% c("Angra dos Reis (RJ)", "Areal (RJ)")) %>%
  summarize_if(is.numeric, sum, na.rm = TRUE) %>%
  as.list(.) %>%
  c(MUN = "X") %>%
  bind_rows(df, .)

Result:

                      MUN X1990 X1991 X1992 X1993
1     Angra dos Reis (RJ)    11    10    10    10
2            Aperibé (RJ)    NA    NA    NA    NA
3           Araruama (RJ) 12040 14589 14231 14231
4              Areal (RJ)    NA    NA    NA     3
5 Armação dos Búzios (RJ)    NA    NA    NA    NA
6                       X    11    10    10    13

Data (from @Gregor with stringsAsFactors = FALSE):

df = read.table(text = "             MUN          X1990  X1991  X1992 X1993
                 1     'Angra dos Reis (RJ)'    11    10    10    10
                 2            'Aperibé (RJ)'    NA    NA    NA    NA
                 3           'Araruama (RJ)'  12040 14589 14231 14231
                 4              'Areal (RJ)'    NA    NA    NA     3
                 5 'Armação dos Búzios (RJ)'    NA    NA    NA    NA", header= T, stringsAsFactors = FALSE)
acylam
  • 18,231
  • 5
  • 36
  • 45
0

I have assumed that you would like to sum data of two municipalites whose names you know/specify and then add their sum at the end of the table. I was not sure if this understanding is correct. You might need to specify your question again in case below code is not what you need (e.g., concerning if you need to sum multiple municipalities each time or only two at a time, etc.)

Furthermore, if you have to call the function I proposed a large number of times or your table is really large, it needs to be improved in terms of speed, e.g., by using the package data.table instead of base R (since you said you are a beginner I sticked to base R).

To fulfull your request of keeping NA values where possible I have used code proposed by Joshua Ulrich as answer to this question rowSums but keeping NA values.

data <- data.frame(MUN = c("Angra dos Reis (RJ)", "Aperibé (RJ)", "Araruama (RJ)", "Areal (RJ)", "Armação dos Búzios (RJ)")
               ,X1990 = c(11, NA, 12040, NA, NA)
               ,X1991 = c(10, NA, 14589, NA, NA)
               ,X1992 = c(10, NA, 14231, NA, NA)
               ,X1993 = c(10, NA, 12231, 3, NA)
)

sum_rows <- function(df, row1, row2) {

  #get the indices of the two rows to be summed
  #grep returns the position in a vector at which a certain element is stored
  #here the name of the municipality 
  index_row1 <-  grep(row1, df$MUN, fixed=T)
  index_row2 <-  grep(row2, df$MUN, fixed=T)

  #select the two rows of the data.frame that you want to sum
  #on basis of the entry in the MUN column
  #further only select the column with numbers for the sum operation
  #check if all entries in a single column are NA values
  #if yes then the ouput for this column is NA
  #if no calculate the column sum, if one entry is NA, ignore it
  sum <- ifelse(apply(is.na(df[c(index_row1, index_row2),2:ncol(df)]),2,all)
                      ,NA
                      ,colSums(df[c(index_row1, index_row2),2:ncol(df)],na.rm=TRUE)
               )

  #create a name entry for the new MUN column
  #paste0 is used to combine strings
  #in this case it might make sense to create a name 
  #that includes the indices of the rows that have been summed instad of only using X as name
  name <- paste0("Sum_R",index_row1,"_R" , index_row2)

  #add the row to the original data.frame
  df <-  cbind(MUN = c(as.character(df$MUN), name)
               ,rbind(df[, 2:ncol(df)], sum)
              )

  #return the data.frame from the function
  df

} 

#sum two rows and replace your data.frame by the new result
data <- sum_rows(data, "Angra dos Reis (RJ)", "Areal (RJ)")

data <- sum_rows(data, "Armação dos Búzios (RJ)", "Areal (RJ)")
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
  • thanks for your answer. You understood my question right, nice code, i understood most part of it. i just didnt get how the row name gets settled `name <- paste0("Sum_R", grep(row1, df$MUN, fixed=T),"_R" ,grep(row2, df$MUN, fixed=T))`. Also i need the NA values, because they differ from the 0 ones, and when i run the code without the part `data[ is.na(data)] <- 0` the output comes with an error. – tnorio Oct 24 '17 at 00:38
  • I updated my answer now addressing all your needs (including some explanations. I don't know why you got an error before, on my machine the code did not produce an error. Since the error causing line is gone now, this is solved. Please note that I agree with the data formatting issues highlighted by @Gregor. I just kept the `data.frame` format to demonstrate how it works (or how complicated it may be). Also note that the `dpylr`package highlightet by @useR might be worth a look for writing elgant code. – Manuel Bickel Oct 24 '17 at 08:21