-2

This is a working regex:

/(ANSI|AAMVA) (\d{6})(\d{2})(\d{2})(\d{0,2})((?:DL)|(?:ID))+(.*?)\g{-2}+([^"]+)/

This is a sample string:

"@\n\nANSI 6334290212DL00389199ZO04420478DLDAQ3572928\nDAASMITH, JOHN DOE\nDAG\nDAL4389 NE 47TH AVE\nDAIASHLAND\nDAJOR\nDAK97555      \nDARC   \nDASD         \nDATM     \nDAU504\nDAW180\nDBA12201212\nDBB19780303\n"

I am trying to match a delimiter, either DL or ID, that may be in the string a second time.

I want to match whichever of DL or IDmatched previously.

The problem is, if I use ? to accomplish this it stops being greedy and prefers 0 matches.

I'm stumped, am I missing something basic with how ? operates?

Edit: The problem isn't extracting the JSON data, it's parsing the msg bit, using JSON doesn't do anything to accomplish this. I trimmed the string to just the pertinent part.

The fix by @hobbs works because it let's me change the ? to a + and still match nothing, if nothing is there.

Works! :)

/(ANSI|AAMVA) (\d{6})(\d{2})(\d{2})(\d{0,2})((?:DL)|(?:ID))+(.*?)(?:\g{-2}|(?="))+([^"]+)/
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Judd
  • 9
  • 1
  • Welcome to Stack Overlow. Your formatting is hard to read. Can you edit your question to include a clean sample of the actual problem you are trying to solve? – Tim Biegeleisen Aug 26 '16 at 01:57
  • `ID` doesn't occur anywhere in your sample data. What are you really trying to do? – Borodin Aug 26 '16 at 02:25
  • Most of that string is irrelevant. Trim it down, show what output you get, and show what output you want. – ikegami Aug 26 '16 at 04:09
  • Your sample string looks an awful lot like JSON. Why not parse it as JSON? – Sobrique Aug 26 '16 at 07:58
  • The ID appears in different data, this the what the barcode on the back of a driver's license(DL) or ID contains. There are a lot of oddball states and a range of versions that add fun edge cases. – Judd Aug 26 '16 at 16:25

2 Answers2

3

The problem isn't that \g{-2}? is non-greedy, it's that the (.*?) immediately before it is non-greedy, and \g{-2}? is capable of matching nothing, which means that it can't fail. And if it can't fail, then it doesn't force the group before it to match more than 0 characters. So invariably, (.*?) will match nothing, \g{-2} will match nothing, and ([^"]+) will match everything.

I don't entirely understand the format you're trying to extract (other than that it's old and weird and reminds me of CIBER billing records), but I would suggest that you either need more anchoring to focus your regex's attention on the right place, or you need to upgrade to something like a proper parser for the format. Since you're saying that you added the ? to handle the case where the delimiter never appears, the quickest band-aid fix would possibly be (?:\g{-2}|(?=")) which asserts that you either find the delimiter, or you got to the closing quote without finding it.

Although, Borodin's observation is also valid; it would be much better to decode the JSON first and then work with the string from the decoded JSON structure, instead of trying to run a regex on the JSON directly. In that case, you should be looking for \z (end of string) rather than ".

hobbs
  • 223,387
  • 19
  • 210
  • 288
  • *"it's old and weird"* You'll have to defend that. I have no idea what your background is, but the OP's data is JSON. It's a very useful and common representation of simple data that is supported by JavaScript, where it originated, Python, Perl, PHP, Java, C and its family, and others. It has its own MIME type of `application/json` and I would guess that it is at least as popular as XML. Where have you been? – Borodin Aug 26 '16 at 02:59
  • 1
    @Borodin I'm obviously not talking about JSON. You didn't pay sufficient attention to the question or my answer. – hobbs Aug 26 '16 at 03:02
  • I read your submission carefully. While it may make perfect sense in your head, you never mention JSON and begin a new paragraph with *"I don't entirely understand the format you're trying to extract"*. To me that means you are clueless about the whole of the OP's data. You also have a semicolon ( these `;` ) after *"Although, Borodin's observation is also valid"* which indicates that you disagree with me. Drop the comma ( `,` ) and use a colon ( `:` ) and the sense of your post will agree with your comment. – Borodin Aug 26 '16 at 03:09
  • if the json is decoded first, `/s` would also be needed – ysth Aug 26 '16 at 09:54
  • Making the match be the \g{-2} OR an empty string was good fix, it let me add the + sign without missing when the 2nd one isn't there :), I assumed that the \g{-2}? would prefer to match 1 over 0, but seems like that's not the case. – Judd Aug 26 '16 at 16:31
  • @Judd it *does* prefer to match 1 over 0, but if it's attempted at a location where it can't match, it *will* match 0. – hobbs Aug 26 '16 at 17:58
  • @hobbs Ahh, I get it, very subtle. Thank you. – Judd Aug 26 '16 at 18:05
0

Your data is JSON, and it is very wrong to try to process it using regex patterns. There are perfectly good Perl modules to convert the text to a navigable data structure

I can't understand exactly what you need because you're talking about DA and ID strings, and ID doesn't occur anywhere in your sample data. But this short program should help

use strict;
use warnings 'all';
use feature 'say';

use JSON 'decode_json';

my $json = do {
    local $/;
    <DATA>;
};

my $data = decode_json $json;

say $data->{msg};


__DATA__
{"name":"SC","hostname":"tukwila","pid":11,"level":30,"msg":"@\n\nANSI 6334290212DL00389199ZO04420478DLDAQ3572928\nDAASMITH, JOHN DOE\nDAG\nDAL4389 NE 47TH AVE\nDAIASHLAND\nDAJOR\nDAK97555      \nDARC   \nDASD         \nDATM     \nDAU504\nDAW180\nDBA12201212\nDBB19780303\n","time":"2016-04-02T01:09:07.113Z","v":0}

output

@

ANSI 6334290212DL00389199ZO04420478DLDAQ3572928
DAASMITH, JOHN DOE
DAG
DAL4389 NE 47TH AVE
DAIASHLAND
DAJOR
DAK97555      
DARC   
DASD         
DATM     
DAU504
DAW180
DBA12201212
DBB19780303
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • 1
    It's not the JSON that's the interesting part. Sure, this is the right first step, but my understanding is that the real task is extracting bits of data like e.g. `00389199ZO04420478`. – hobbs Aug 26 '16 at 02:35
  • @hobbs: My apologies. I missed that you weren't the OP – Borodin Aug 26 '16 at 03:12
  • @hobbs is correct, the JSON stuff is handled elsewhere, it's the ANSI stream that's the relevant part for my question. – Judd Aug 26 '16 at 16:07