So I have a dataset in Stata that has a variable called "program description" that has very similar observations although the observations don't follow any pattern. My objective is to clean the variable so that the observations which are very similar will have the same name.
Here is an example of what the variable looks like:
Variable Name
phys ed
physical education
phys ed k-12
learning disabilities
learn dis
learn disable
Therefore, I would like the first three to just be called "phys ed" (or some derivative of that) and the last three to just be called "learning disabilities"
I've been using the function strpos()
to replace observations that contain certain phrases but because the variable has 100k observations and a lot of different names, this takes a while.