2

I am actually working with a .tsv database whose headers are full of meaningful things for me.

I thus wanted to rip them off from the header to something that I & others users (non proficient with relational databases, so we mostly use Excel in the end to organize data and process it) would be more able to handle with Excel, by breaking them up with tabs.

Example header:

>(name1)database-ID:database2-ID:value1:value2

(I know this seems strange to put values in an header but this is descriptive of parameters of the third value associated to the header, that we don't have to mess here) output as:

name1\tdatabase-ID\tdatabase2-ID\tvalue1\tvalue2\n

I thus pasted my data (headers, one per line) in EmEditor (BOOST syntax) and came with this regex:

 >\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n

with each capturing group being then separated from others by inserting tabs between each others. It works, with perfect matches, no problem.

But I became aware there were malformed lines that didn't respected the logic of the whole database, and I wanted to make an expression to separate them at once.

If I make it with wrong lines it would be:

>(name1)database-ID:database2-ID:value1-1:value1-2\n
>(name2)database-ID:database2-ID:value2-1:value2-2\n
>(name3)database-ID:database2-ID:value3-1value3-2\n

Last line is ill-formed because it lacks the : between both last values. I want it to be matched by working around the original expression that recognizes well-formed lines.

I perfectly know that I could came with different solutions by slightly tweaking my first expression for eliminating the good lines and retrieving misformed one after but I don't want a solution to my process, I just want to understand what I made not well there; so that I become more educated (and not just more tricky by being able to circumvent my mistakes that I can't resolve):

I tried a negation of the above mentioned expression:

([^(>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n)])

That doesn't match with anything.

I tried a negative lookahead, but It will be extremely, painfully slow then will match every 0-length matches possible in the document:

(?!(^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))

I thus added a group capture for a string of characters behind, but it doesn't work either:

(?!(^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))(^.*?)

So please explain me where I have been wrong with the negating group ([^whatever]) and the use of the negative lookahead?

Ando Jurai
  • 1,003
  • 2
  • 14
  • 29

3 Answers3

3

You could do this simply through PCRE Verb (*SKIP)(*F). The below regex would match all the bad-lines.

(?:^>\([^()]*\):[^:]*:[^:]*:[^:]*:[^:\n]*$)(*SKIP)(*F)|^.+

DEMO

Community
  • 1
  • 1
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Thanks for your answer. Actually I am not wrong for the ^> part, when trying to match well formed lines I want to match lines that are starting with a >, hence the ^> (they don't appear in the header example, I suppose there is some string eater up there that mess with me, even in code blocks). Taking that into account, should my modification of (?!(^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))(^.*?) be ^(?!(>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))(^.*?) based on what you did? I still don't understand what doesn't work in my initial use of grouping? – Ando Jurai Nov 06 '14 at 16:12
  • @AndoJurai could you provide an example with expected output? – Avinash Raj Nov 06 '14 at 16:17
  • you could also use `(?:^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n)(*SKIP)(*F)|^.+` – Avinash Raj Nov 06 '14 at 16:20
  • Well, the exemple for correct lines was my example string before you edited it: >(name1):database-ID:database2-ID:value1:value2\n and the output would be name1\tdatabase-ID\tdatabase2-ID\tvalue1\tvalue2\n If I make it with wrong lines it would be: >(name1):database-ID:database2-ID:value1-1:value1-2\n >(name2):database-ID:database2-ID:value2-1:value2-2\n >(name3):database-ID:database2-ID:value3-1value3-2\n Last line is ill formed because of lacking ":" between both last values. I want it to be matched by using the negative lookahead, based on the expression that recognizes well-formed lines – Ando Jurai Nov 06 '14 at 16:23
  • add the above comment to your question because it's hard to read. – Avinash Raj Nov 06 '14 at 16:25
  • Thanks for the hint, I don't even know what is SKIP and F there, and why it doesn't match with good lines (the logic seems to me that the part in the non capturing group recognizes good lines, so how can the alternative work then? But I still want to understand what mistake I made with the lookahead? Why doesn't it work, what is my logic fail there? – Ando Jurai Nov 06 '14 at 16:27
  • it's hard to read your input, it needs formatting so that i could understand and point out the error. – Avinash Raj Nov 06 '14 at 16:29
  • yep, the non-capturing group captures the good one and make them to skip. Then the alternate `^.+` matches the bad lines. – Avinash Raj Nov 06 '14 at 16:30
  • Thanks for help. I am sorry, tried to improve the formatting adding the example but the page mess with my newline feeds -_- – Ando Jurai Nov 06 '14 at 16:36
  • OMG. it's hard to understand you input. Could you post the input in pastebin? – Avinash Raj Nov 06 '14 at 16:37
  • Well thanks Jongware for editing it well, I don't understand why four space indentation didn't worked for me at first. But now it might be readable. – Ando Jurai Nov 06 '14 at 16:55
  • Thanks for your help, but as I have stated, I more want to understand what is my original mistake with the negative lookahead that an alternative solution (even if I am glad you took the time and taught me new ways-which I would upvote for if I could) Also, your expression doesn't match the original matching expression, and "make mistake" as it doesn't recognize >(name3):database-ID::database2-ID:value3-1value3-2 as wrong while it is (there are two consecutive ":"), which is why I first want to have the opposite of my regular match). So where is my logic/syntax failing, please – Ando Jurai Nov 07 '14 at 09:08
  • so you want atleast a character present inbetween `:`. If yes, the you could use this http://regex101.com/r/sP9tL6/2 regex. You negated character class is wrong. For ex, this `[^(.*)]` matches any char but not of `(` or `.` or `*` or `)` – Avinash Raj Nov 07 '14 at 09:21
  • @Unihedron did you see the revisions on my answer. At first, i recommend the same regex you mentioned. Hard to understand the op needs. – Avinash Raj Nov 07 '14 at 09:31
  • Eh? It's not, I'm suggesting this: `^>\([^()]*\)(?::[^:]*){3}:[^:\n]*$(*SKIP)(*F)|^.+` – Unihedron Nov 07 '14 at 09:33
  • @Unihedron i'm ready to edit like above, but i think op wants atleast a character present inbetween `:` symbols. – Avinash Raj Nov 07 '14 at 09:35
  • Thanks for your help and explanation with negated character class. I don't want something that works differently from original matching-with-good-lines regex: >\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n . Different tools in last proposed expressions are not what I am looking for. I repeat: I don't want solutions that uses different techniques, just to know what I do wrong. Purpose is educational, not practical. Thanks for . I would also like to understand why using a negative lookahead doesn't work. Either I match every 0-lenght solutions, either I don't match misformed lines while they should – Ando Jurai Nov 07 '14 at 09:54
  • see what's wrong with your regex here http://stackoverflow.com/a/26797767/3297613 – Avinash Raj Nov 07 '14 at 09:57
  • Actually the logic I want to use is, "If you don't see this string (with the correct matching regex), by using the negative lookahead, then match the thing". I know that lookarounds don't capture so I used a generic matching behind the lookahead (.*?), but that won't work. My question is WHY? – Ando Jurai Nov 07 '14 at 10:00
3

So please explain me where I have been wrong with the negating group ([^whatever]) and the use of the negative lookahead?

Let's address the question first: What does [^(pattern)] do?

You seem to have a misunderstanding and expect it to:

  • Match everything except the subpattern pattern. (Negation)

What it actually does is to:

  • Match any character that aren't (, p, a, t, ... n, ).

Therefore, the pattern

([^(>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n)])

... Matches a character that aren't (, >, (, ... \n, ).


As for the negative lookahead, you're simply doing it wrong. The anchor ^ is in the wrong position, therefore your assertion will fail to provide any useful help. It's also not what negative lookaheads are for altogether.

(?!(^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))

I'll explain what this does:

  • (?! Open negative lookahead group: Assert the position does not match this pattern, without moving the pointer position.
  •     ( Capturing group. The use of capturing groups in negative lookaheads are useless, as the subpattern in negative lookahead groups never matches.
  •         ^ Assert position at start of string.
  •         >\( Literal character sequence ">(".
  •         (.*) Capturing group which matches as many characters as possible except newlines, then backtracks.
  •         \) Literal character ")".
  •         (.*?) Capturing group with reluctant zero-to-one match of any characters except newlines.
  •         \: Literal character ":".
  •         (.*?)\:(.*?)\:(.*?)
  •         \n A new line.
  •     ) Closes capturing group.
  • ) Closes negative lookahead group. When this assertion is finished, the pointer position is same as beginning, and thus the resulting match is zero-length.

Note that the anchor is nested within the negative lookahead group. It should be at the start:

^(?!(>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))

While this doesn't return anything useful, it explains what is wrong, since you don't need a solution. ;)

In case you are in need of a solution suddenly, please refer to this relevant answer of mine (I'm not adding anything else into the post):

Community
  • 1
  • 1
Unihedron
  • 10,902
  • 13
  • 62
  • 72
  • Thanks! You got it:) Actually I added the capturing group in order to be able to separate (^.*?) when I added it to be able to match something (and I tested the use of the capturing group so it wouldn't perturbate the original behavior to be able to compare. You said that negative lookahead are not for that, but if I understand it, it is for matching for example a thing like (EX(?!AMPLE).*) which can match EXE ou EXES but not EXAMPLE. So why can't ((?!pattern).*) match everything but the pattern? – Ando Jurai Nov 07 '14 at 10:11
  • 1
    It works, but it only negates at the pointer position which the regex runs. [Quick refresher: How regex works: The state machine always reads from left to right, backtracking where necessary.](http://stackoverflow.com/a/25511635/3622940) Negative lookaheads are to assert that a subpattern cannot be matched, not to negate a subpattern. – Unihedron Nov 07 '14 at 10:13
1

Based on what I have been reading from Unihedron;
This is what I came for in emEditor:

^(?!>\(([A-Za-z0-9_\'\-]*?)\)(([A-Za-z0-9_\'\-]*?)\:){3}([A-Za-z0-9_\'\-]*?)\n).*\n
>(name1)database-ID:database2-ID:value1-1:value1-2
(NOT MATCH)

>(name2)database-ID:database2-ID:value2-1:value2-2
(NOT MATCH)

>(name3)database-ID:database2-ID:value3-1value3-2
(MATCH)

>(name3)database-ID::database2-ID:value3-1:value3-2
(MATCH)

(the character class avoid discarding names including special characters without making it possible to have two subsequent ":".)

I also could achieve the same results with:

(?!^>\(([A-Za-z0-9_\'\-]*?)\)(([A-Za-z0-9_\'\-]*?)\:){3}([A-Za-z0-9_\'\-]*?)\n)^.*\n

So I guess that all along capturing groups were what was messing with my lookahead.

Now I acknowledge that Avinash Raj is more efficient with the (*SKIP)(*F)|^.+ pattern, just that I didn't know about those functions and I also wanted to understand my logic / syntax mistake. (Thanks to Unihedron for that)

Community
  • 1
  • 1
Ando Jurai
  • 1,003
  • 2
  • 14
  • 29