How to group a dataframe, and then count the distinct values in R

Question

I have a dataframe in R that has 43 Variables and over 80 rows. I would like to group the data, based on one variable - geographical region, and then count the distinct values of a variable (How many 0s, 1s, 2s, 3s and NAs, etc).

I know the group_by and summarize functions in the tidyverse, and I understand I can use functions like "sum" and "mean", but I want to count

I tried est1 <- df %>% group_by(region) %>% summarize(count)

The data looks like this:

    iso3      Country WHOregion       WBIncomeGroup UrbanSanPol UrbanSanWom UrbanSanExt RuralSanPol RuralSanWom
  <chr>        <chr>     <chr>               <chr>       <chr>       <chr>       <chr>       <chr>       <chr>
1   AFG  Afghanistan      EMRO          Low income           0        <NA>        <NA>           1           1
2   ALB      Albania      EURO Upper middle income           1           0           0           1           0
3   ARG    Argentina      PAHO Upper middle income           1           0         0.5           1           0
4   AZE   Azerbaijan      EURO Upper middle income           1           1         0.5           1           1
5   BDI      Burundi      AFRO          Low income           1           1         0.5           1           1
6   BFA Burkina Faso      AFRO          Low income           1           1           1           1           1

But it's not what I want. Could anyone help please?

structure(list(iso3 = c("AFG", "ALB", "ARG", "AZE", "BDI", "BFA", 
"BGD", "BIH", "BLR", "BOL"), Country = c("Afghanistan", "Albania", 
"Argentina", "Azerbaijan", "Burundi", "Burkina Faso", "Bangladesh", 
"Bosnia and Herzegovina", "Belarus", "Bolivia (Plurinational State of)"
), WHOregion = c("EMRO", "EURO", "PAHO", "EURO", "AFRO", "AFRO", 
"SEARO", "EURO", "EURO", "PAHO"), WBIncomeGroup = c("Low income", 
"Upper middle income", "Upper middle income", "Upper middle income", 
"Low income", "Low income", "Lower middle income", "Upper middle income", 
"Upper middle income", "Lower middle income"), UrbanSanPol = c("0", 
"1", "1", "1", "1", "1", "1", "1", "1", "1"), UrbanSanWom = c(NA, 
"0", "0", "1", "1", "1", "1", "0", NA, "0"), UrbanSanExt = c(NA, 
"0", "0.5", "0.5", "0.5", "1", "0.5", "0", "0.5", "0"), RuralSanPol = c("1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1"), RuralSanWom = c("1", 
"0", "0", "1", "1", "1", "1", "0", NA, "0"), RuralSanExt = c("0.5", 
"0", "0", "0.5", "0.5", "1", "0.5", "0", "0.5", "0.5"), UrbanDWPol = c("0", 
"1", "1", "1", "1", "1", "1", "1", "1", "1"), UrbanDWWom = c(NA, 
"0", "0", "1", "1", "1", "1", "0", NA, "0"), UrbanDWExt = c(NA, 
"0", "0.5", "1", "0", "0.5", "0.5", "0.5", "0.5", "0"), RuralDWPol = c("1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1"), RuralDWWom = c("1", 
"0", "0", "1", "1", "1", "1", "0", NA, "0"), RuralDWExt = c("0.5", 
"0", "0", "1", "0.5", "1", "0.5", "0.5", "0.5", "0.5"), HygienePol = c("1", 
"1", "0", "1", "1", "1", "1", "1", "1", "0"), HygieneWom = c("1", 
NA, NA, "1", "1", "1", "1", "0", NA, "0"), HygieneExt = c("0.5", 
NA, NA, "0", "0.5", "0", "0.5", "0", "0.5", "0"), WASHHealthPol = c("1", 
"1", "0", "1", "1", "1", "1", "1", "0", "0"), WASHHealthWom = c("0", 
NA, NA, "1", "1", "1", "1", "0", NA, "0"), WASHHealthExt = c("0", 
NA, "0.5", "1", "0", "0.5", "0", "0", NA, "0"), WpollutionPol = c("1", 
"1", "1", "1", "1", "1", "1", "1", "1", "0"), WpollutionWom = c("1", 
NA, "0", "1", "1", "1", "1", "0", NA, "0"), WpollutionExt = c("0", 
NA, "0", "1", "0", "0.5", "0", "0", "0.5", "0"), WQMPol = c("1", 
"1", "1", "1", "1", "1", "1", "1", "1", "0"), WQMWom = c("1", 
NA, "0", "1", "1", "1", "1", "0", NA, "0"), WQMExt = c("0", NA, 
"0", "1", "0", "0.5", "0", "0", "0.5", "0"), WatRightPol = c("0", 
"1", "1", "1", NA, "1", "1", "1", "1", "1"), WatRightWom = c("0", 
NA, "0", "1", NA, "1", "1", "0", NA, "0"), WatRightExt = c("0", 
NA, "0.5", "1", NA, "1", "0", "0", "0.5", "0.5"), WRMPol = c("1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1"), WRMWom = c("0", 
NA, "0", "1", "1", "1", "1", "0", NA, "0"), WRMExt = c("0", NA, 
"0.5", "1", "0.5", "1", "0", "0", "0.5", "0"), EnvProtPol = c("1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1"), EnvProtWom = c("0", 
NA, "0", "1", "1", "1", "1", "0", NA, "0"), EnvProtExt = c("0", 
NA, "0", "1", "0", "1", "0", "0", "0.5", "0"), `SDG regions` = c("Central Asia (M49) and Southern Asia (MDG=M49)", 
"Northern America (M49) and Europe (M49)", "Latin America and the Caribbean (MDG=M49)", 
"Western Asia (M49) and Northern Africa (M49)", "Sub-Saharan Africa (M49)", 
"Sub-Saharan Africa (M49)", "Central Asia (M49) and Southern Asia (MDG=M49)", 
"Northern America (M49) and Europe (M49)", "Northern America (M49) and Europe (M49)", 
"Latin America and the Caribbean (MDG=M49)"), M49_level1 = c("Asia (M49)", 
"Europe (M49)", "Latin America and the Caribbean (MDG=M49)", 
"Asia (M49)", "Sub-Saharan Africa (M49)", "Sub-Saharan Africa (M49)", 
"Asia (M49)", "Europe (M49)", "Europe (M49)", "Latin America and the Caribbean (MDG=M49)"
), M49_level2 = c("Southern Asia (MDG=M49)", "Southern Europe (M49)", 
"South America (M49)", "Western Asia (M49)", "Eastern Africa (M49)", 
"Western Africa (M49)", "Southern Asia (MDG=M49)", "Southern Europe (M49)", 
"Eastern Europe (M49)", "South America (M49)"), LDCs = c("Least Developed Countries (LDCs)", 
NA, NA, NA, "Least Developed Countries (LDCs)", "Least Developed Countries (LDCs)", 
"Least Developed Countries (LDCs)", NA, NA, NA), LLDCS_SIDS = c("Landlocked developing countries (LLDCs)", 
NA, NA, "Landlocked developing countries (LLDCs)", "Landlocked developing countries (LLDCs)", 
"Landlocked developing countries (LLDCs)", NA, NA, NA, "Landlocked developing countries (LLDCs)"
), `Income group` = c("Low income", "Upper middle income", "Upper middle income", 
"Upper middle income", "Low income", "Low income", "Lower middle income", 
"Upper middle income", "Upper middle income", "Lower middle income"
)), .Names = c("iso3", "Country", "WHOregion", "WBIncomeGroup", 
"UrbanSanPol", "UrbanSanWom", "UrbanSanExt", "RuralSanPol", "RuralSanWom", 
"RuralSanExt", "UrbanDWPol", "UrbanDWWom", "UrbanDWExt", "RuralDWPol", 
"RuralDWWom", "RuralDWExt", "HygienePol", "HygieneWom", "HygieneExt", 
"WASHHealthPol", "WASHHealthWom", "WASHHealthExt", "WpollutionPol", 
"WpollutionWom", "WpollutionExt", "WQMPol", "WQMWom", "WQMExt", 
"WatRightPol", "WatRightWom", "WatRightExt", "WRMPol", "WRMWom", 
"WRMExt", "EnvProtPol", "EnvProtWom", "EnvProtExt", "SDG regions", 
"M49_level1", "M49_level2", "LDCs", "LLDCS_SIDS", "Income group"
), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"

enter code here

Maybe look into `?table`? Also we can provide a much better response if you post some sample data with your variables of interest and several rows (not the full dataset, just what we need to reproduce and understand your issue) — Mike H., Jan 24 '18 at 15:11
Not sure how you want the groupings, but maybe something like this: `lapply(split(mtcars, mtcars$cyl), function(x) lapply(x, table))`? Replace `mtcars` with your data frame, and `$cyl` with `$region`. — Gregor Thomas, Jan 24 '18 at 15:19
To switch the nesting, `lapply(mtcars[-2], function(x) lapply(split(x, mtcars$cyl), table))`, where `2` is the column index of the grouping column. — Gregor Thomas, Jan 24 '18 at 15:23
Thanks, as you can see I've tried to add my dataframe, without success - how to copy and paste it into this box? — user1778351, Jan 24 '18 at 15:23
Thanks Gregor. Imagine I have three columns. The first is a list of countries (France, Germany, etc). The second is a list of regions (Asia, Europe), the third is a discrete value for each country (number of Olympic Gold medals). I want to group all the data by region, and count how many times for each region, 0 occurs, 1 occurs, 2 occurs. — user1778351, Jan 24 '18 at 15:25
Not 100% clear on what you are after, but it sounds like it could be accomplished using `count`, `tally` or `distinct` from `dplyr` - have a look at the help files for those functions. — konvas, Jan 24 '18 at 15:27
Please run this code with your df: dput(head(df,10)) and paste the output so we can run tests on example code — leeum, Jan 24 '18 at 15:28
The table you posted is not helpful - no one wants to type that out themselves. Please post a dput of your data. dput(head(df,10)) — leeum, Jan 24 '18 at 15:36
Absolutely trying to post the dput - however, it is not working! How to print the code — user1778351, Jan 24 '18 at 15:47

score 0 · Answer 1 · answered Jan 24 '18 at 16:01

Imagine I have three columns. The first is a list of countries (France, Germany, etc). The second is a list of regions (Asia, Europe), the third is a discrete value for each country (number of Olympic Gold medals). I want to group all the data by region, and count how many times for each region, 0 occurs, 1 occurs, 2 occurs.

From what you said in the comments, and from what I understood:

A interpretation:

df %>% select(continent,countries,medals) %>% group_by(continent,countries) %>% summarize(count =n())

Another interpretation: What you want is each continent and a count of distinct amount of medals won by it.

The numbers are the Olympic medals won by the country.

df <- as.data.frame(matrix(c("Asia","Asia","Asia","Asia","Europe","Europe","India","China","Bangladesh","Japan","Spain", "Italy",6,3,4,4,3,3),ncol = 3)) 

df %>% group_by(V1,V3) %>% summarise(count= n()) %>% spread(V3,count)

Gives me an output

# A tibble: 2 x 4
# Groups:   V1 [2]
      V1   `3`   `4`   `6`
* <fctr> <int> <int> <int>
1   Asia     1     2     1
2 Europe     2    NA    NA

By jingo! You have it, the second option is what I want. Thanks so much. I really really appreciate. I'm quite new to R and I'm slowly picking up tricks! — user1778351, Jan 24 '18 at 16:14

score -1 · Accepted Answer · answered Feb 05 '18 at 04:55

-1

Try this ; need dplyr and tidyverse

distinct_cnt <-  input_df  %>%
    gather(variable, value) %>%
    group_by(variable) %>%
    summarise(n_distinct(value))

answered Feb 05 '18 at 04:55

Saranga

154
1
3

How to group a dataframe, and then count the distinct values in R

2 Answers2