0

I'm writing a script that anonymizes participant data from a file.

Basically, I have:

  • A folder of plaintext participant data (sometimes CSV, sometimes XML, sometimes TXT)
  • A file of known usernames and accompanying anonymous IDs (e.g. jsmith1 as a known username, User123 as an anonymous ID)

I want to replace every instance of the known username with the corresponding anonymous ID.

Generally speaking, what I have works just fine -- it loads in the usernames and anonymous IDs into a dictionary and one by one runs a find-and-replace on the document text for each.

However, this script also strips out names, and it runs into some difficulty when it encounters names contained in other names. So, for example, I have two pairs:

John,User123 Johnny,User456

Now, when I run the find-and-replace, it may first encounter John, and as a result it replaces Johnny with User123ny, and then doesn't trigger Johnny.

The simplest solution I can think of is just to run the find-and-replace from longest key to shortest. To do that, it looks like I need a SortedDictionary.

However, I can't seem to convince Visual Basic to take my custom Comparer for this. How do you specify this? What I have is:

Sub Main()
    Dim nameDict As New SortedDictionary(Of String, String)(AddressOf SortKeyByLength)
End Sub

Public Function SortKeyByLength(key1 As String, key2 As String) As Integer
    If key1.Length > key2.Length Then
        Return 1
    ElseIf key1.Length < key2.Length Then
        Return -1
    Else
        Return 0
    End If
End Function

(The full details above are in case anyone has any better ideas for how to resolve this problem in general.)

David
  • 303
  • 1
  • 2
  • 8
  • is there some reason you're not looking for exact matches of the username in the first place? i.e. why should it replace Johnny with John's anonymous id? – monty Oct 07 '15 at 05:03
  • That's what I'm trying to do -- but 'John' is contained within 'Johnny', and this is a straight string comparison. – David Oct 07 '15 at 17:19
  • Sure, but if it's a straight string comparison then 'John' <> 'Johnny'. Can you please show where you are doing your string comparison? This might save you all of this SortedDictionary malarkey. – monty Oct 07 '15 at 20:35
  • Oh, I see -- I'm using using String.Replace. The reason is that I'm doing these find-and-replaces in plaintext log files of forum interacts, among other things, so I'm not even identifying in advance when I'm comparing a single word. For example, the text might be, "Well John Smith said that he wants to do this..." -- I need to replace 'John' and 'Smith' to anonymize the post transcript. – David Oct 07 '15 at 22:38
  • But if you searched for " John " then it wouldn't pick up on "Johnny" for the replace. Additionally you may want to allow a fullstop at the end, so " John." should be replace with " ." aswell. – monty Oct 07 '15 at 23:10
  • True, but it also wouldn't pick up "

    John" or "John

    " in plaintext, or ",John" or "John," in CSV files.
    – David Oct 12 '15 at 19:26
  • So, are you doing this in a variety of files: csv, html, etc? You might want a regex to take care of all the edge cases... basically any value either side of your name that is not in the alphabet, i.e [^A-Za-z]John[^A-Za-z] (then you'd just need to take care of a John at the beginning of your string, and the end) – monty Oct 12 '15 at 22:56
  • What benefit would that grant over simply going from longest to shortest? – David Oct 14 '15 at 00:54
  • 1. clearer code as to what you're actually doing. 2. you'd have to do speed comparisons to know which was faster so I wouldn't make any claims here other than to say, it'd depend on your data. – monty Oct 14 '15 at 02:35

1 Answers1

1

I think it takes a class that implements the IComparer interface, so you'd want something like:

Public Class ByLengthComparer
    Implements IComparer(Of String)

    Public Function Compare(key1 As String, key2 As String) As Integer Implements IComparer(Of String).Compare
        If key1.Length > key2.Length Then
            Return 1
        ElseIf key1.Length < key2.Length Then
            Return -1
        Else
            '[edit: in response to comments below]
            'Return 0
            Return key1.Compare(key2)
        End If
    End Function
End Class

Then, inside your main method, you'd call it like this:

Dim nameDict As New SortedDictionary(Of String, String)(New ByLengthComparer())  

You might want to take a look (or a relook) at the documentation for the SortedDictionary constructor, and how to make a class that implements IComparer.

monty
  • 1,543
  • 14
  • 30
  • Thank you! You're right, I was getting this mixed up with how to specify a sorting method on an array. However, one weird thing. When I do this, it seems to treat any two keys with the same length as duplicates of one another. Any idea how to get around that? – David Oct 07 '15 at 17:24
  • In the hanging else you'd want to add the usual string compare then. Do you want that to be case-sensitive or case-insensitive? – monty Oct 07 '15 at 20:35
  • Oh I see -- so basically if the comparer identifies them as equal, the dictionary identifies them as equal. Case doesn't really matter to me -- the function I have in mind requires only that it go longest to shortest. Thank you! – David Oct 07 '15 at 22:36