2

I have a dataframe with multiple subjects (company), year, personalname and gender (Female,Male). I want to obtain the year personalname changes (if there is a change). Also, if a change occurs during that specific year, I would like to create two binary variables: "FemaletoMale" (and MaletoFemale) indicating that the change occurs from female to male (or male to female).


So, if I had a table like

companyid year   personalname gender 
 1         1990  Alison       Female
 1         1991  Alison       Female
 1         1992  Kate         Female
 1         1993  Kate         Female
 2         1990  George       Male
 2         1991  Kate         Female
 2         1992  Kate         Female
 3         1990  Michael      Male
 3         1991  Dwight       Male

I am aware of that question that helps me to count the number of changes: How to tell if a value changed over dimensions in R

df<- df %>% group_by(companyid) %>% summarise(ChangeYear = sprintf("%s to %s", min(year), max(year)), change.count = length(unique(personalname)) - 1) This gives me the number of changes. What I wanted to see is;

companyid  change.count  changeyear  FemaletoMale MaletoFemale
 1               1             1992         0            0          
 2               1             1991         0            1
 3               1             1991         0            0
convex895
  • 25
  • 4

2 Answers2

1

Using dplyr you can do :

library(dplyr)

df %>%
  group_by(companyid) %>%
  summarise(change.count = n_distinct(personalname) - 1, 
            changeyear = year[personalname != lag(personalname, default = first(personalname))], 
            FemaletoMale = sum(gender == 'Male' & lag(gender) == 'Female', na.rm = TRUE),
            MaletoFemale = sum(gender == 'Female' & lag(gender) == 'Male', na.rm = TRUE))


#  companyid change.count changeyear FemaletoMale MaletoFemale
#      <int>        <dbl>      <int>        <int>        <int>
#1         1            1       1992            0            0
#2         2            1       1991            0            1
#3         3            1       1991            0            0
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Does this work:

library(dplyr)
df %>% group_by(companyid) %>% 
  summarise(change.count = n_distinct(personalname)-1, 
         changeyear = year[personalname != lag(personalname)][2], 
         FemaletoMale = case_when(gender[1] == 'Female' & gender[n()] == 'Male' ~ 1, TRUE ~ 0),
         MaletoFemale = case_when(gender[1] == 'Male' & gender[n()] == 'Female' ~ 1, TRUE ~ 0))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 5
  companyid change.count changeyear FemaletoMale MaletoFemale
      <dbl>        <dbl>      <dbl>        <dbl>        <dbl>
1         1            1       1992            0            0
2         2            1       1991            0            1
3         3            1       1991            0            0
Karthik S
  • 11,348
  • 2
  • 11
  • 25
  • This is exactly what I was looking for, thanks. One quick follow-up question, changeyear gives me NA's even when change.count is >0. Is there a solution for that? Thanks again. – convex895 Nov 22 '20 at 07:22
  • 1
    @convex895, are you selecting the second occurance using [2]? First value in the vector will be NA as there won't the a 'lag' for the first row for each group. – Karthik S Nov 22 '20 at 07:34
  • yes I am selecting the second occurrence but all I got is NA's. In my original data, I have observations where no changes occurs. I am curious if it is related to that. – convex895 Nov 22 '20 at 08:15
  • @convex895, so if there is no change for a particular companyid, what value do you need the new "changeyear" to have or you don't need that companyid in the final output? – Karthik S Nov 22 '20 at 08:38
  • @convex895, if you don't need to include companyid's wherein there's no change, you can simply use %>% na.omit() at the end to my code, it will do the job. – Karthik S Nov 22 '20 at 08:53
  • NA can stay for the case when there is no change or we can also omit them as you said but my problem is "changeyear" is somehow full of NA's. – convex895 Nov 22 '20 at 17:03
  • 1
    @convex895, not sure why, the code works for the sample data you shared. Is the sample data you shared a subset of your actual data or representative of your actual data? – Karthik S Nov 22 '20 at 17:48