I'm trying to match some super complex personal identifiers across ~6000 survey responses, currently stored in a big Excel doc.
In essence, respondents participated at time 1 and 2. They were each required to input their government identifier at both time points: either a "CPF" if an adult, or an "RA" if a child. They were additionally requested, but not required, to provide a phone number to serve as a backup ID.
So, theoretically I just need to find the matching pairs of CPF or RA to link the responses! Then, I could assign these pairs a NEW unique identifier, and code them as T1 or T2, for the different time points.
However, the inputs and character types are wildly varying. I'm struggling, conceptually, to figure out how I'd write a code -- Stata, R -- or use Excel formulas to match this mess.
Could someone please help me with this mind-bending process of 1) matching response pairs using varying ID formats (RA or CPF, with cell number as a backup if neither work), 2) creating a new identifier for each pair, and 3) making a T1 vs T2 variable for each pair?
The data is:
- RA -- Student Identifier
- A 9-digit personal student identifier. Can sometimes include a “-“ or “.” or “ “ preceding the final digit. Can also sometimes include a two-letter identifier at the end of the string (often “SP”, sometimes preceded by a space or hyphen). Can also sometimes include a 000 or 111 preceding the string. E.g.000123456789-X, 113488865x, 63530748, 000826751-5, 11145950310, 000112957300-x So: 000123456789-x = 000123456789x = 000123456789 = 0123456789 x = 123456789-x = 123456789 etc
- CPF –- Adult identifier
- An 11-digit personal government identifier. Can sometimes include “.” or “-“ or “ “ but each of these characters is irrelevant. May at times have an incorrect OR missing character preceding or ending the string (e.g. 4520781485 (missing one) or 223826338162 (one additional)), which needs to be removed to reattempt a match E.g.213.598.268-74, 9655287823, 22382633816, 4520781485 So: 111.222.333-44 = 11122233344 = 111222333 = 0111.222.33344 etc
- Cell number – Backup (cellphone) identifier
- A BACKUP identifier, 9-digit, should the previous identifiers not provide a match. The respondent’s phone number, with a two-number prefix (“DDD”, should be “11” in the region if they are local). However, may have spaces, “.”, “-“ or other confounding characters, such as “+” – particularly a “+55” which is the Brazil national code. So: 11 222 333 444 = +55 11 222 333 444 = 5511 222 333 444 = 222.333.444 = 11- 222-333-444 etc
I tried to use a simple MATCH function in Excel, which returned almost no matches. I had no idea where to get started in a coding language!