Regex for ^ | in C#

Question

I am working on HL7 messages and I need a regex. This doesn't work:

HL7 message=MSH|^~\&|DATACAPTOR|123|123|20100816171948|ORU^R01|081617194802900|P|2.3|8859/1

My regex is:

MSH|^~\&|DATACAPTOR|\d{3}|\d{3}|(\d{4}\d{2}\d{2}\d{2}\d{2}\d{2})|ORU\\^R01|\d{20}|P|2.3|8859/1

Can anybody suggest a regex for special characters? I am using this code:

strRegex = "\\vMSH|^~\\&|DATACAPTOR|\\d{3}|\\d{3}|
(\\d{4}\\d{2}\\d{2}\\d{2}\\d{2}\\d{2})|ORU\\^R01|\\d{20}|P|2.3|8859/1";
Regex rx = new Regex(strRegex, RegexOptions.Compiled | RegexOptions.IgnoreCase );

What is it you are trying to match?. It is not very clear. Perhaps there is an easier way. — , Aug 21 '14 at 23:16
**(0)** Welcome to Stack Overflow **(1)** Please don't even start parsing this protocol using regular expressions (unless it is just for a toy learning project). If your software is supposed to be serious part of the healthcare industry, **please be serious software vendor** and use serious parser, e.g. http://nhapi.sourceforge.net/home.php — xmojmr, Aug 22 '14 at 05:14
@xmojmr. I think you critique is too harsh. It should be allowed for everybody to use the tools he likes, as long as he adheres to the spirit and the norms of HL7. — sqlab, Aug 22 '14 at 10:14
@sqlab C# regex parser especially that one designed by Vijaykumar does not "adhere to the spirit and the norms of HL7". I'm programmer responsible for HL7 message reader/writer in our software. I've spend several weeks of refactoring and coding a logging system in order to troubleshoot problem with "Siemens MPPS Brooker" we had in few hundreds kilometers distant city. Unable to diagnose and troubleshoot it in a distance. No support from the software vendor (sort of our competition). We were guessing what is wrong and then built the troubleshooter and found that.. — xmojmr, Aug 22 '14 at 12:38
@sqlab ..their software used '\u000A' instead of '\u000D' as segment terminator. Some programmer at Siemens thought they are just new lines and they look similar and HL7 messages can be processed by a Linux text processors because they look like text files. It cost us few dozens man/days to pay for someone elses bug. They did not fix it. They even did not admit it. Inability of HL7 software to preserve non-ASCII character set, not following the MSH-18 and not supporting the extended character set escape codes - it is a norm in systems we were interfacing, not an exception... — xmojmr, Aug 22 '14 at 12:39
@sqlab ..Chinese software vendors of "cheap" laboratory machines are another nightmare we have met. Healthcare software manipulates **data about living people and their health conditions and can influence their health and even cause death**. It is not a funny web tweeting software every 14 days old programmer can write or create by copy/pasting codes Googled from internet or various support forums. Vijaykumar, **please be serious software vendor** — xmojmr, Aug 22 '14 at 12:39
@xmojmr From this question, it's not clear if this is really a '*software vendor*'-type question. I've had to create little utilities from time to time to manipulate very specific, controlled HL7 messages. One particular one was a very simple program to extract an embedded PDF. Since I controlled the format of the HL7, too, I knew exactly what kind of variance I needed to account for and was able to code it up in few lines of code with a solution simpler than the one described in my answer. Your comments are all valid and correct, but I'm not sure they apply in this particular case. — p.s.w.g, Aug 22 '14 at 13:31
@p.s.w.g although it is not sure if my comments are applicable in this case, IMO it is quite likely and if it is the case then it should be avoided. I tried to write more into standalone answer — xmojmr, Aug 22 '14 at 19:46
@sqlab there IS something in the question showing that something is wrong and it is the HL7 snippet which is supposed to be MSH segment fragment is completely messed already. You can see more details in my answer — xmojmr, Aug 22 '14 at 19:48

p.s.w.g · Answer 1 · 2014-08-21T22:13:54.053

|, ^, and \ are all special characters in regular expressions, so you'd have to escape them with \. Remember \ is also an escape character within a regular string literal so you'd have to escape that, too:

var strRegex = "\\vMSH\\|\\^~\\\\&\\|DATACAPTOR\\|…

But it's generally a lot easier to use a verbatim string literal (@"…"):

var strRegex = @"\vMSH\|\^~\\&\|DATACAPTOR\|…

Finally, note that (\d{4}\d{2}\d{2}\d{2}\d{2}\d{2}) can be simplified to (\d{14}).

However, for a structure like this, it's probably easier to just use the Split method.

var segment = "MSH|^~\&|DATACAPTOR…";
var fields = segment.Split('|');
var timestamp = fields[5];

^{Warning: HL7 messages may use different control characters—starting the 4th character in the MSH segment as a field separator (in this case |^~\& are the control characters). It's best to parse the control characters first if you don't control your input and these control characters may change.}

Can you make the "Warning.." much much much bigger? HL7 v2 is not a funny text protocol that you can manipulate using 'awk' and replace segment terminators with arbitrary new line characters. It is a strictly defined binary protocol. It turns binary especially once the character sets come into play.. — xmojmr, Aug 22 '14 at 05:11

score 4 · Answer 2 · edited Aug 22 '14 at 20:03

For me your question describes two distinct problems.

Problem 1) "..I need a regex..this doesn't work..My regex is..anybody suggest a (better) regex..?"

This is the good part of your question.

As already pointed out by @p-s-w-g some special characters in regular expressions must be escaped. Page Microsoft Developer Network: Character Escapes in Regular Expressions tells you which characters are special and how to escape them.

In order to easily test if your regex recognizes the grammar you may find useful some interactive regex testing tools, e.g. Regex Hero or The Regulator

Problem 2) "I am working on HL7 messages..this doesn't work..My regex is..anybody suggest a (better) regex..?"

This is the bad part of your question.

The

MSH|^~\&|DATACAPTOR|123|123|20100816171948|ORU^R01|081617194802900|P|2.3|8859/1

example shown in your question is already not valid HL7 message fragment. It is something similar to HL7 but it is was already damaged probably by some text pre-processing code. HL7 v2 messages are not transmitted using text protocol that can be manipulated using text tools. The protocol is binary but at the same time partially readable and thus controllable by humans without any special tools. But it is binary protocol and must be processed as such. Regex is a tool for working with text strings not binary strings. And although it may seem possible to outsmart some ancient 20 years old protocol by a new-age regex one-liner, it is not good approach. I have tried to explain the why not in the comment part of your question.

Basic decoding of the fragment is:

MSH-0: MSH
MSH-1: |
MSH-2: ^~\&
MSH-3: DATACAPTOR
MSH-4: 123
MSH-5: 123
MSH-6: ! missing !
MSH-7: 20100816171948
MSH-8: ! missing !
MSH-9: ORU^R01
MSH-10: 081617194802900
MSH-11: P
MSH-12: 2.3
MSH-13: ! missing !
MSH-14: ! missing !
MSH-15: ! missing !
MSH-16: ! missing !
MSH-17: ! missing !
MSH-18: 8859/1

The ! missing ! pieces are really missing. In normal MSH segment they should be there at their corresponding positions, just having default empty value.

The message was created 4 years ago in 2010, probably by Capsule Tech, Inc.'s DataCaptor™ and formatted by rules defined by Health Level Seven, Version 2.3© 1997 that is by 17 years old and several times updated standard and was supposed to be used by one of the countries listed in Wikipedia: ISO/IEC 8859-1

From your question I can't see more, but whatever you are trying to do and whatever data you are going to process for whatever reason, the code fragment you are starting with is already wrong, in general the HL7 regex parsing approach is strange and if you're working on a serious software to be used anywhere in the healthcare industry, please consider writing or using a serious and tested parser, e.g. the one used by NHapi library http://sourceforge.net/p/nhapi/code/HEAD/tree/NHapi20/NHapi.Base/Parser/PipeParser.cs

+1 for the clear discussion of what's wrong with this message and the suggested solutions. This answer would make an excellent addition to the proposed [Healthcare IT](http://area51.stackexchange.com/proposals/65896/healthcare-it?referrer=jqJsC2-RpEkd7A_oGCSFyQ2) site (if it makes it out of Area 51). — p.s.w.g, Aug 22 '14 at 19:55

Regex for ^ | in C#

2 Answers2

Problem 1) "..I need a regex..this doesn't work..My regex is..anybody suggest a (better) regex..?"

Problem 2) "I am working on HL7 messages..this doesn't work..My regex is..anybody suggest a (better) regex..?"

Linked