Merge lines which don't match a regex

Question

I have a file which contains logs from the web; a simplified version of it is as follows:

en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
Unix
Linux
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
START
Solaris
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
Aix
SCO

I have tried a couple of Regex combinations to identify the Accept-Language which is the beginning of every line using the following with awk/sed:

/^[a-z]{2}(-[A-Z]{2})?/
/\*|[A-Z]{1,8}(-[A-Z0-9]{1,8})*/i  
/([^-;]*)(?:-([^;]*))?(?:;q=([0-9]\.[0-9]))?/

So far I haven't managed to get either awk/sed to give me the following results:

en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;    Unix    Linux
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;    STAR    Solaris
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;    Aix    SCO

Any help is appreciated. The file contains about 1 Million+ records so I'm happy to go down a route that doesn't use sed/awk and improves performance.

In your desired output, I believe you have an extra line. Remove one of the first three? — Rob Davis, Dec 23 '16 at 21:20

Lars Fischer · Answer 1 · 2016-12-23T17:54:10.280

3

Based on the observation, that we can distinguish the two types of lines on the =, you can use this awk script:

file.awk

$0 ~ /=/ { printf("%s%s", v,$0)
           v="\n"
           next
         } 
         { printf("\t%s", $0) } 
END      { printf("\n") }

You use it like this: awk -f file.awk yourfile

v is empty for the first line, later it contains the linebreak
for lines with an =, we print $0 preceded by v
for the other lines (note the next in the first action), we print $0 without the newline but with a \t as separation

edited Dec 23 '16 at 17:54

answered Dec 23 '16 at 17:48

Lars Fischer

9,135
3
26
35

Don't hard-code the newline string with `printf("\n")`, use `print ""` instead so awk can use whatever newline string is appropriate for the environment it's being called from, e.g. `\r\n`. Also, you don't need the `$0 ~` as that's the default. – Ed Morton Dec 24 '16 at 16:15

Rob Davis · Answer 2 · 2016-12-31T17:20:56.500

Just for fun, here's a sed solution:

sed -ne 1bgo \
   -e '/^[a-z][a-z]-[A-Z][A-Z]/ { x;p;s/.*//;x; };:go' \
   -e 'H;x;s/^\n//;s/\n/  /;x;${ x;p; }' < input

It works like this:

Read each line but instead of printing it right away, save it by appending it to the hold space (H), except remove any newlines that separate it from whatever was already there (x;s/^\n//;s/\n/ /;x). (If you want tabs in your output, put them here where I've put a couple of spaces.)
If you come across a line that matches your Accept-Language pattern, flush the hold space before you append anything to it. Print it and clear it (x;p;s/.*//;x). Then proceed as usual with the appending and whatnot.
Treat the first and last lines differently from all others: never flush the hold space after reading just the first line (1bgo skips past that, down to the position labeled :go), and always flush the hold space after reading the last line (${ x;p; })

This also works perfectly I should add! The only problem with it is that it takes a long time to process data on a larger scale (especially with Mapreduce). Thanks for your response — Amine Jaidi, Dec 28 '16 at 20:26

James Brown · Accepted Answer · 2016-12-25T11:07:30.787

0

$ awk '/[a-z]{2}-[A-Z]{2}/ { print b; b=$0; next }  # @xx-XX empty buffer, refill
                           { b=b OFS $0 }           # otherwise append to buffer
                       END { print b }' file        # dump the buffer in the end

en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd; Unix Linux
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd; START Solaris
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd; Aix SCO

You will get an empty line to start the output with. Also, use tab delimiter on output if so desired: awk -v OFS="\t" ....

edited Dec 25 '16 at 11:07

answered Dec 25 '16 at 10:59

James Brown

36,089
7
43
59

This script has not worked for me it takes all the lines into one. – Amine Jaidi Dec 28 '16 at 10:25
@AmineJaidi That's odd. What's your environment and which awk do you use? – James Brown Dec 28 '16 at 11:11
I'm on Redhat not using GAWK. The things is the file is already \t delimited to basically the problem I am trying to solve here is make sure that all lines that are not starting with a regex Accept-language one to be appended to the previous one. The sed solution below is working it would be good to know how AWK can do it I've had a stab with no luck. I have implemented the SED solution as part of a reduce function in Hadoop and it is quite slow – Amine Jaidi Dec 28 '16 at 12:58
First thing that comes to mind is that your awk doesn't support `{2}` in regex. Replace regex: `/[a-z]{2}-[A-Z]{2}` with `/[a-z][a-z]-[A-Z][A-Z]`. – James Brown Dec 28 '16 at 13:21
1

That was spot on now the script works now but it adds a \n at the beginning of the file :( – Amine Jaidi Dec 28 '16 at 13:41
I know, I mentioned that in my solution. Gimme a second to see if it's an easy one. – James Brown Dec 28 '16 at 13:57
Adding `if(b!="")` before the `print b; ...` in the first line should fix the problem (or `if(NR>1)`. – James Brown Dec 28 '16 at 14:00

Merge lines which don't match a regex

3 Answers3