0

I wrote the following and it works w/out errors.

df2$qualifications <- as.numeric(grepl("high school|Bachelor|master|phd",df2$description,ignore.case=TRUE))
df2$qualifications

This is the output, which shows 1 if any of the words above is mentioned and 0 otherwise.

[1] 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 0
 [51] 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 1
[101] 0 1 0 0

This is a dataset with job postings along with the education qualifications they are searching for and I am interested in assigning a dummy variable for each educational level mentioned in a job's description.

Specifically, I am looking for something that looks like below, where 0 is where no qualifications is mentioned 1 High school 2 Bachelor 3 masters 4 phd

1] 0 2 4 1 3 1 0 1 0 1 1 1 2 1 0 1 
maldini425
  • 307
  • 3
  • 14
  • 1
    Check out the `mapvalues` function from the `plyr` package. – Baraliuh Mar 23 '21 at 18:30
  • 1
    Can you post sample data? Please edit **the question** with the output of `dput(df2$qualifications)`. Or, if it is too big with the output of `dput(head(df2$qualifications, 20))`. – Rui Barradas Mar 23 '21 at 18:32

3 Answers3

2

Using for-loops:

df2 = data.frame(description = sample(educ, 100, TRUE))
df2$qualifications = NA #creating empty column

#placing the possible levels into a vector
educ = c("high school", "Bachelor", "master", "phd")

#for each value in educ, if description has that value assign the new column one of the 4 numbers
for(i in educ){
  value = grepl(i, df2$description, ignore.case=TRUE)
  df2$qualifications[which(value)] = (1:4)[educ==i]}

As you're already creating a categorical variable, i'd recommend using the

2

You can also do this with case_when from dplyr:

library(dplyr)

df %>% 
  dplyr::mutate(qualifications = case_when(
    grepl("high school", description, ignore.case = T) ~ 1,
    grepl("Bachelor", description, ignore.case = T) ~ 2,
    grepl("master", description, ignore.case = T) ~ 3,
    grepl("phd", description, ignore.case = T) ~ 4,
    T ~ 0
  ))
LMc
  • 12,577
  • 3
  • 31
  • 43
  • ```Library(dplyr)``` makes ```dplyr::``` redundant. – rjen Mar 23 '21 at 18:51
  • In this case yes, but in the event there is masking then no. I also like to use it to post in SO questions so posters know where different functions are coming from. – LMc Mar 23 '21 at 19:15
1

Using plyr's mapvalues function:

tibble::tibble(
  dummy_data = sample(c('no qual', 'high school', 'Bachelor', 'master', 'phd'), 20, replace = T)
) %>% 
  mutate(
    dummy_variable = plyr::mapvalues(dummy_data, c('no qual', 'high school', 'Bachelor', 'master', 'phd'), 0:4),
    dummy_variable = as.integer(dummy_variable)
  )

Output:

# A tibble: 20 x 2
   dummy_data  dummy_variable
   <chr>                <int>
 1 no qual                  0
 2 phd                      4
 3 phd                      4
 4 high school              1
 5 no qual                  0
 6 phd                      4
 7 no qual                  0
 8 no qual                  0
 9 no qual                  0
10 no qual                  0
11 master                   3
12 phd                      4
13 high school              1
14 no qual                  0
15 Bachelor                 2
16 high school              1
17 high school              1
18 phd                      4
19 phd                      4
20 phd                      4
Baraliuh
  • 593
  • 3
  • 12