Detect and remove duplicate HL7 messages in a log

Question

I'm trying to populate a new EMR with data from an existing environment. I am pulling a log of all activity for a given interface and feeding it in to the inbound channel in the new environment. The problem is our existing channel has duplicates of the messages which will create duplicate reports in the patient records.

Beyond looking through what feels like the entire internet I've tried pushing text around in Iguana, PowerShell and Excel and I'm not familiar enough with MirthConnect to make use of it. I'm not married to any one solution, I just need a solution and PDQ.

I found a fairly good starting point at https://www.secretgeek.net/ps_duplicates and I've been massaging it but still no complete solution. At this point I've basically reset it to zero because nothing I've done has improved it (mostly I broke it repeatedly).

$hash = @{}                                                 #Define an empty hashtable
gc "c:\Samples\Q12019.txt" | #Send the content of the file into the pipeline...
  % {
      if ($hash.$_ -eq $null) {                             #if that line isn't a key in the hash table
                                                              # $_ is data from the pipe
        $_                                                  #send the data down the pipe
      };
    $hash.$_ = 1                                            #add that line to the hash so it doesn't resend
  } > "c:\Samples\RadHx Test Q12019.txt"

This does some trippy stuff I don't understand. It ingests the file and the output has a new space B E T W E E N every single character in the file. I can't even tell if it's removing duplicates and I haven't been able to get it to stop doing this. I'm also not sure it's reading an entire message including all of it's segments. Example 2 at https://healthstandards.com/blog/2007/09/10/variations-of-the-hl7-orur01-message-format/ looks close enough to what I'm dealing with as an example of ingest, just add 2000 more in a text file.

Simplified explanation: I have a text file with several blocks of related text. Each block has the same starting sequence of characters, say 'ABC'. The blocks have an arbitrary length and don't necessarily end with the same string but all blocks end with CRLF. Problem: Each block may not be unique but I need to eliminate repeating blocks of text so the file only contains one instance of each block of text.

I said Excel because it's one place where this issue could be resolved since I'm dealing with it as a text file and not directly in an interface. — Gryyphyn, Jun 25 '19 at 19:59
Maybe `gc "c:\Samples\Q12019.txt" -Encoding Unicode` could help? Maybe handling spaces and `␋` characters in input as `$hash."$_"` could help as well? — JosefZ, Jun 25 '19 at 22:43
@JosefZ Thank you. The encoding flag corrected the extra spaces and improved the processing time significantly. Still getting the duplicates but I think that's because it's treating each line as a unique string when it processes the text stream. — Gryyphyn, Jun 26 '19 at 20:14

score 0 · Answer 1 · answered Jul 10 '19 at 23:12

Mirth should be able to easily debatch the file for you. If the messages are exact duplicates, you can probably just keep track as you go of a few of the MSH fields that should guarantee uniqueness.

If they were resends of the same data, where it is mostly the same, but some fields (especially in the MSH segment) may be updated, you'll probably want to exclude some of the segments, then hash the message, and track that instead (maybe with a patient id or something, in the rare case of a hash collision.)

You can store information in the globalChannelMap to compare values across messages. The map exists in memory only and won't survive a mirth restart, but that shouldn't be a problem for your one time conversion. If you need something more persistent, store the values in a database.

Thanks for the comment agermano. I was miles off course. I went a totally different route, not even processing in HL7. I actually wound up using a combination of MS SSMS to remove the duplicates and posh to reformat the segments. Much easier in the end. — Gryyphyn, Oct 31 '19 at 16:21

Detect and remove duplicate HL7 messages in a log

1 Answers1