Managing duplicates that are not entered as duplicates in R

Question

I have a data set from a state agency and am trying to clean it up. One obstacle is that there are no input standards for titles (e.g., DIR, DIRECTOR, DIR., are all allowable inputs). Another obstacle is that an individual may have several job titles, but only 1 of them is entered.

For example, consider employee #1 below (Emp_1). This person is a faculty member. They teach year after year. But in 2015, in addition to teaching duties, they also picked up some work that would be classified as non-faculty, or exempt. But their ‘real’ classification is faculty. In this example, Emp_1 has been given two records. I’ve done a quick grep to create a new classification titled “job.cat” that helps identify faculty. But in the case of Emp_1 and Emp_4 below, you can see that they get miscategorized as ‘non-faculty’. Now, I could just do a grep for “STIPEND/COORD FAC,EXMT EMP” and count that as faculty too, but there are some people who never teach and are always “STIPEND/COORD FAC,EXMT EMP”, and so should be counted as non-faculty.

To solve this, I’m thinking I should create a second data set of names, job titles, and job categories, clean that up, and then merge it back to this data below where I’d join using employee_name and the correct job. cat. But I’m wondering if there is a better way to do this using an apply like function, an ifelse, or a for-loop.

d <- read.table(text = 'employee_name   job_title   Salary_2012 Salary_2013 Salary_2014 Salary_2015 job.cat
Emp_1   FACULTY 31200   37400   33300   NA  Faculty
Emp_1   "STIPEND/COORD FAC,EXMT EMP"  NA  NA  NA  37300   Non-fac
Emp_2   FACULTY 29300   28400   31800   NA Faculty
Emp_2   "PART TIME FACULTY"   NA  NA  NA  30800   Faculty
Emp_3   FACULTY NA  NA  4300    NA  Faculty
Emp_4   FACULTY 50000   59900   31300   NA  Faculty
Emp_4   "STIPEND/COORD FAC,EXMT EMP"  NA  NA  NA  22000   Non-fac', header = TRUE)

Thanks for that @BenBolker appropriate for another project I'm working on right now that whose data are the 10th circle of hell. — Brandon Bertelsen, Nov 23 '16 at 19:58

Brandon Bertelsen · Accepted Answer · 2016-11-23T20:02:28.563

0

I think you can approach this by grouping results by employee_name, checking if that employee has any faculty job, then defining them as faculty:

library(dplyr)
d %>% 
  group_by(employee_name) %>% 
  mutate(job.cat = ifelse(any(grep("FACULTY",job_title)),"Faculty", "Non-fac"))

Your other problem could have a similar solution:

d %>% 
  group_by(employee_name) %>% 
  mutate(job_title = ifelse(any(grep("DIR",job_title)),"Director", job_title))

edited Nov 23 '16 at 20:02

answered Nov 23 '16 at 19:56

Brandon Bertelsen

43,807
34
160
255

Could you please include suggestion to address cases like these `DIR, DIRECTOR, DIR.` i.e. `^DIR.*` – Silence Dogood Nov 23 '16 at 20:02
Updated, you can see that it's a very similar approach. Essentially we're backfilling with the most important piece of information. – Brandon Bertelsen Nov 23 '16 at 20:05
Thanks, guess user has to look for unique values of `job_title` and create corresponding mapping as outlined in your solution to ensure all cases/possiblities are addressed – Silence Dogood Nov 23 '16 at 20:10
1

`grep` takes a regex argument. So if this is a large data set with many options you would want to make sure it's capturing what you expect with checks like: `grep("DIR", d$job_title) %>% unique()` and then refine your regex if necessary to narrow the selection. – Brandon Bertelsen Nov 23 '16 at 20:15
Thanks all for your comments regading regex on the titles. There are around 1000 employees over about 10 years of data, and even more job titles. The 'director' is one of the more trivial examples For instance, we have job titles such as "chancellor" "vice chancellor" "v. chan" "executive sec to chan", etc. I've got several regex expressions that have tried to go through many of these situations. But each year when a new batch of data is given to the state, a whole host of new naming conventions are given and I have to re-write the greps. – Brian Holt Nov 23 '16 at 22:38
--Brandon, thank you for that first code. Using 'any' is new to me and is the logic I was looking for. Thank you. I'm going to give it a go and see what happens before I consider this solved. Cheers – Brian Holt Nov 23 '16 at 22:40
1

First look at a sorted vector of unique job titles, `d$job_title %>% unique %>% sort` then pull out the ones that are equal by your understanding and use regex `grep("vice chancellor|^v[.]*chan$", d$job_title)` in the example that I've given you. – Brandon Bertelsen Nov 23 '16 at 23:48

Managing duplicates that are not entered as duplicates in R

1 Answers1