I am actually working with a .tsv
database whose headers are full of meaningful things for me.
I thus wanted to rip them off from the header to something that I & others users (non proficient with relational databases, so we mostly use Excel in the end to organize data and process it) would be more able to handle with Excel, by breaking them up with tabs.
Example header:
>(name1)database-ID:database2-ID:value1:value2
(I know this seems strange to put values in an header but this is descriptive of parameters of the third value associated to the header, that we don't have to mess here) output as:
name1\tdatabase-ID\tdatabase2-ID\tvalue1\tvalue2\n
I thus pasted my data (headers, one per line) in EmEditor (BOOST syntax) and came with this regex:
>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n
with each capturing group being then separated from others by inserting tabs between each others. It works, with perfect matches, no problem.
But I became aware there were malformed lines that didn't respected the logic of the whole database, and I wanted to make an expression to separate them at once.
If I make it with wrong lines it would be:
>(name1)database-ID:database2-ID:value1-1:value1-2\n
>(name2)database-ID:database2-ID:value2-1:value2-2\n
>(name3)database-ID:database2-ID:value3-1value3-2\n
Last line is ill-formed because it lacks the :
between both last values.
I want it to be matched by working around the original expression that recognizes well-formed lines.
I perfectly know that I could came with different solutions by slightly tweaking my first expression for eliminating the good lines and retrieving misformed one after but I don't want a solution to my process, I just want to understand what I made not well there; so that I become more educated (and not just more tricky by being able to circumvent my mistakes that I can't resolve):
I tried a negation of the above mentioned expression:
([^(>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n)])
That doesn't match with anything.
I tried a negative lookahead, but It will be extremely, painfully slow then will match every 0-length matches possible in the document:
(?!(^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))
I thus added a group capture for a string of characters behind, but it doesn't work either:
(?!(^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))(^.*?)
So please explain me where I have been wrong with the negating group ([^whatever]
) and the use of the negative lookahead?