Complex survey data matching and cleaning

Question

I'm trying to match some super complex personal identifiers across ~6000 survey responses, currently stored in a big Excel doc.

In essence, respondents participated at time 1 and 2. They were each required to input their government identifier at both time points: either a "CPF" if an adult, or an "RA" if a child. They were additionally requested, but not required, to provide a phone number to serve as a backup ID.

So, theoretically I just need to find the matching pairs of CPF or RA to link the responses! Then, I could assign these pairs a NEW unique identifier, and code them as T1 or T2, for the different time points.

However, the inputs and character types are wildly varying. I'm struggling, conceptually, to figure out how I'd write a code -- Stata, R -- or use Excel formulas to match this mess.

Could someone please help me with this mind-bending process of 1) matching response pairs using varying ID formats (RA or CPF, with cell number as a backup if neither work), 2) creating a new identifier for each pair, and 3) making a T1 vs T2 variable for each pair?

The data is:

RA -- Student Identifier

A 9-digit personal student identifier. Can sometimes include a “-“ or “.” or “ “ preceding the final digit. Can also sometimes include a two-letter identifier at the end of the string (often “SP”, sometimes preceded by a space or hyphen). Can also sometimes include a 000 or 111 preceding the string. E.g.000123456789-X, 113488865x, 63530748, 000826751-5, 11145950310, 000112957300-x So: 000123456789-x = 000123456789x = 000123456789 = 0123456789 x = 123456789-x = 123456789 etc

CPF –- Adult identifier

An 11-digit personal government identifier. Can sometimes include “.” or “-“ or “ “ but each of these characters is irrelevant. May at times have an incorrect OR missing character preceding or ending the string (e.g. 4520781485 (missing one) or 223826338162 (one additional)), which needs to be removed to reattempt a match E.g.213.598.268-74, 9655287823, 22382633816, 4520781485 So: 111.222.333-44 = 11122233344 = 111222333 = 0111.222.33344 etc

Cell number – Backup (cellphone) identifier

A BACKUP identifier, 9-digit, should the previous identifiers not provide a match. The respondent’s phone number, with a two-number prefix (“DDD”, should be “11” in the region if they are local). However, may have spaces, “.”, “-“ or other confounding characters, such as “+” – particularly a “+55” which is the Brazil national code. So: 11 222 333 444 = +55 11 222 333 444 = 5511 222 333 444 = 222.333.444 = 11- 222-333-444 etc

I tried to use a simple MATCH function in Excel, which returned almost no matches. I had no idea where to get started in a coding language!

So the problem is that the data might be stored by any of those format and you need a way to find any match between say an RA written like 000123456789-X and the "same one" but written like 123456789x or 111123456789X; is that correct? — Evil Blue Monkey, Mar 13 '23 at 21:01
Exactly! Trying to find matches between pairs of RA or CPF entries, which may be written in different ways. Precisely as you'd noted, 000123456789-X could match 123456789x or 111123456789X etc. — user18103480, Mar 13 '23 at 21:27
It sounds like you need to start by cleaning your data. Get rid of the trailing 000s and 111s, get rid of the '-', '+', and '.' separators and the trailing spaces. Once you’ve cleaned the data, I suggest you try a fuzzy matching algorithm. I would use `dplyr` and `stringr` for the data cleaning and `fuzzyjoin::stringdist_join` to do a fuzzy match if you want to do this in R :) — mfg3z0, Mar 13 '23 at 22:29
+1 @mfg3z0 I have spent over 25 years dealing with 'messy' data. In my experience, it is ALWAYS faster to clean your data prior to matching rather than trying to create a method to deal with 'messy' data. I generally maintain the original unique identifier fields and then create new 'clean' ID fields for the matching. That way, you can keep trace everything back to the original data if necessary — L Tyrone, Mar 13 '23 at 22:55
Please provide enough code so others can better understand or reproduce the problem. — Community, Mar 14 '23 at 11:04

score 0 · Answer 1 · answered Mar 14 '23 at 16:46

A possible solution might include clearing the data (as suggested by mfg3z0 and Leroy Tyrone in the comments), maybe via a code similar to this:

Sub SubClearData()
    
    Dim RngSource As Range
    Dim VarOutput As Variant
    Dim DblRow As Double
    Dim DblColumn As Double
    Dim StrValue As String
    Dim DblCounter01 As Double
    Dim StrDataType As String
    
    On Error Resume Next
CP_RngSource_Setting:
    Set RngSource = Application.InputBox("Select the range to edit:", "Range to edit", , , , , , 8)
    If RngSource Is Nothing Then Exit Sub
    If RngSource.Areas.Count > 1 Then
        MsgBox "Select a range with a single area", vbCritical + vbOKOnly, "Incorrect input"
        GoTo CP_RngSource_Setting
    End If
    On Error GoTo 0
    
    VarOutput = RngSource.Value2
    
    StrDataType = UCase(InputBox("Specify type of data (RA, CPF, CN):", "Type of data"))
    
    Select Case StrDataType
        Case Is = "RA"
            
        Case Is = "CPF"
            
        Case Is = "CN"
            
        Case Else
            Exit Sub
    End Select
    
    For DblRow = 1 To UBound(VarOutput, 1)
        For DblColumn = 1 To UBound(VarOutput, 2)
            
            StrValue = VarOutput(DblRow, DblColumn)
            
            For DblCounter01 = 1 To Len(StrValue)
                If Not IsNumeric(Mid(StrValue, DblCounter01, 1)) Then
                    StrValue = Replace(StrValue, Mid(StrValue, DblCounter01, 1), " ")
                End If
            Next
            
            StrValue = Replace(StrValue, " ", "")
            
            Select Case StrDataType
                Case Is = "RA"
                    
                    If Len(StrValue) = 12 And (Left(StrValue, 3) = "000" Or Left(StrValue, 3) = "111") Then
                        StrValue = Right(StrValue, 9)
                    End If
                    
                Case Is = "CPF"
                    
                Case Is = "CN"
                    
                    StrValue = Right(StrValue, 9)
                    
            End Select
            
            VarOutput(DblRow, DblColumn) = StrValue
            
        Next
    Next
    
    RngSource.Value2 = VarOutput
    
    
End Sub

You should copy the data (or a portion of it) to a new sheet and then run the macro. It will ask you to specify the range that hosts the data you need to be cleaned and the type of data. Properly answer and check if the result satisfies you. After you've cleaned the data, you can put them into a filtered list and use the filter to search for the identifier of your choice. The filter allows you to specify part of the identifier which will be searched inside the cells. This way you can obtain a list of all the data with a match (full or partial) to the identifier of your choice.

This is a first draft solution. Better ones will most likely need more informations to be created.

Complex survey data matching and cleaning

1 Answers1