Cleaning up variable with highly similar observations

Question

So I have a dataset in Stata that has a variable called "program description" that has very similar observations although the observations don't follow any pattern. My objective is to clean the variable so that the observations which are very similar will have the same name.

Here is an example of what the variable looks like:

Variable Name

phys ed
physical education
phys ed k-12
learning disabilities
learn dis
learn disable

Therefore, I would like the first three to just be called "phys ed" (or some derivative of that) and the last three to just be called "learning disabilities"

I've been using the function strpos() to replace observations that contain certain phrases but because the variable has 100k observations and a lot of different names, this takes a while.

This has come up several time at www.statalist.com. I would search for answers provided there, as quite a few solutions have been proposed. Unfortunately, I can't provide any links at the moment. — Roberto Ferrer, Nov 05 '14 at 22:11
i've looked everywhere for information and haven't found anything useful so if you find any links, let me know. — Pcarlitz, Nov 06 '14 at 03:16
http://www.stata.com/statalist/archive/2013-11/msg01209.html and http://www.stata.com/statalist/archive/2012-03/msg01135.html and http://www.stata.com/statalist/archive/2004-02/msg00246.html and http://www.statalist.org/forums/forum/general-stata-discussion/general/754-repeated-names-in-a-string-variable-but-some-have-typos-how-to-correct and more generally: https://www.google.com/search?q=statalist+%2B+string+match&gws_rd=ssl#q=statalist+%2B+fuzzy+string+ — Roberto Ferrer, Nov 06 '14 at 03:22
What you are listing as variable **names** and what you are saying a variable is **called** are presumably the **values** of a single variable. A observation in Stata is an entire record or case or row (in spreadsheet terms), not a single value of a single variable. — Nick Cox, Nov 06 '14 at 08:17
Regular expressions are one answer. This kind of problem is common but solutions are difficult to automate fully because fine distinctions demand decisions pm semantics, not literal contents (e.g. being able to recognise similarities despite different spelling or different wording). — Nick Cox, Nov 06 '14 at 08:19

dimitriy · Accepted Answer · 2014-11-06T01:55:48.140

You can use strgroup from SSC, but it's unlikely to get you all the way there. For example, this seems to work:

. strgroup string , gen(group) threshold(.7) normalize(longer)

. list, clean noobs

                   string   group  
                  phys ed       1  
       physical education       1  
             phys ed k-12       1  
    learning disabilities       2  
                learn dis       2  
            learn disable       2

However, "physics" would have been mapped to group 1 with these settings. Also, note that this command is case sensitive, so it might make sense to uppercase/lowercase everything first. The threshold is really a kind of tuning parameter.

I've also had some luck with Google/Open Refine with these problems. This is called reconciliation.

With all these approaches, some standardization goes a long way.

I've been searching for so long to find a command like this...thank you! — Pcarlitz, Nov 06 '14 at 03:12

Cleaning up variable with highly similar observations

1 Answers1