I have a data set from a state agency and am trying to clean it up. One obstacle is that there are no input standards for titles (e.g., DIR, DIRECTOR, DIR., are all allowable inputs). Another obstacle is that an individual may have several job titles, but only 1 of them is entered.
For example, consider employee #1 below (Emp_1). This person is a faculty member. They teach year after year. But in 2015, in addition to teaching duties, they also picked up some work that would be classified as non-faculty, or exempt. But their ‘real’ classification is faculty. In this example, Emp_1 has been given two records. I’ve done a quick grep to create a new classification titled “job.cat” that helps identify faculty. But in the case of Emp_1 and Emp_4 below, you can see that they get miscategorized as ‘non-faculty’. Now, I could just do a grep for “STIPEND/COORD FAC,EXMT EMP” and count that as faculty too, but there are some people who never teach and are always “STIPEND/COORD FAC,EXMT EMP”, and so should be counted as non-faculty.
To solve this, I’m thinking I should create a second data set of names, job titles, and job categories, clean that up, and then merge it back to this data below where I’d join using employee_name and the correct job. cat. But I’m wondering if there is a better way to do this using an apply like function, an ifelse, or a for-loop.
d <- read.table(text = 'employee_name job_title Salary_2012 Salary_2013 Salary_2014 Salary_2015 job.cat
Emp_1 FACULTY 31200 37400 33300 NA Faculty
Emp_1 "STIPEND/COORD FAC,EXMT EMP" NA NA NA 37300 Non-fac
Emp_2 FACULTY 29300 28400 31800 NA Faculty
Emp_2 "PART TIME FACULTY" NA NA NA 30800 Faculty
Emp_3 FACULTY NA NA 4300 NA Faculty
Emp_4 FACULTY 50000 59900 31300 NA Faculty
Emp_4 "STIPEND/COORD FAC,EXMT EMP" NA NA NA 22000 Non-fac', header = TRUE)