0

We receive a file daily with thousands of lines of data. Occasionally, a few lines will be messed up, causing an automated process to fail. When this happens, it can be difficult to find the errors.

I'd like to use a regular expression to find anything not conforming to the files usual structure. All lines are supposed to look like the following:

ABC|SomeText|MoreText
DEF|SomeText|MoreText
ABC|SomeText|MoreText
GHI|SomeText|MoreText
DEF|SomeText|MoreText

So a regex that flags lines that don't begin with 3 letters and a pipebar. In the below example, it would flag line 3.

ABC|SomeText|MoreText
DEF|Some
Text|MoreText
ABC|SomeText|MoreText
GHI|SomeText|MoreText
DEF|SomeText|MoreText

Any help would be appreciated, I've been struggling with this for a while.

Many thanks

Jan
  • 42,290
  • 8
  • 54
  • 79
Cyan02
  • 3
  • 2

2 Answers2

2

It would be very helpful to explain what dialect of regular expressions you are using. For example, if you are using grep, you can just use the -v option to invert the sense and then just write a normal regular expression, like so:

grep -v -E '^[A-Z]{3}\|[^|]*\|'

Otherwise if you can't invert the sense but you have a system capable of using negative lookahead (like Perl) you could do the following:

grep -P '^(?![A-Z]{3}\|[^|]*\|)'

The (?!...) part makes the negative lookahead. So this will match any line where the beginning isn't followed by three capital letters, a bar, some text and then another bar.

Neil Roberts
  • 469
  • 3
  • 11
  • Thanks for the quick response Neil. I'm not sure of the dialect. Currently I'm using the regex feature of notepad++ to search through the files. Not sure if that helps. I'll give these a try! – Cyan02 Aug 12 '16 at 15:54
  • Apparently notepad++ uses PCRE (Perl-compatible regular expressions) so it should support the second one with negative lookahead. Good luck! – Neil Roberts Aug 12 '16 at 15:57
  • You're quite correct, it was #2 for the win. Works like a charm! Sorry, this was my first question ... where do I toggle that as the correct answer? – Cyan02 Aug 12 '16 at 16:06
  • Apparently there should be a tick mark underneath the score for the answer that you can click on. Actually, I've never asked a question before so this is new to me too ☺ http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work – Neil Roberts Aug 12 '16 at 16:09
1

For example, using awk:

awk '!/^[a-zA-Z]{3}\|/' input.txt

will display the 'flagged' lines.

awk '/^[a-zA-Z]{3}\|/' in.txt

will display the correct lines.

wroniasty
  • 7,884
  • 2
  • 32
  • 24