How to extract multiple params from string using sed or awk

Question

I have a log file which looks like this:

2010/01/12/ 12:00 some un related alapha 129495 and the interesting value 45pts
2010/01/12/ 15:00 some un related alapha 129495 and no interesting value
2010/01/13/ 09:00 some un related alapha 345678 and the interesting value 60pts

I'd like to plot the date time string vs interesting value using gnuplot. In order to do that i'm trying to parse the above log file into a csv file which looks like (not all lines in the log have a plottable vale):

2010/01/12/ 12:00, 45

2010/01/13/ 14:00, 60

How can i do this with sed or awk?

I can extract the initial characters something like:

cat partial.log | sed -e 's/^\(.\{17\}\).*/\1/'

but how can i extract the end values?

I've been trying to do this to no avail!

Thanks

Oh, by the way, [don't use `cat` like that](https://web.archive.org/web/20130307065129/http://partmaps.org/era/unix/award.html) — carlpett, Sep 08 '11 at 20:28

score 1 · Accepted Answer · answered Jun 21 '22 at 13:11

Although this is a really old question with many answers, but you can do it without the use of external tools like sed or awk (hence platform-independent). You can "simply" do it with gnuplot (even with the version at that time of OP's question: gnuplot 4.4.0, March 2010).

However, from your example data and description it is not clear whether the value of interest

is strictly in the 12th column or
is always in the last column or
could be in any column but always trailed with pts

For all 3 cases there are gnuplot-only (hence platform-independent) solutions. Assumption is that column separator is space.

ad 1. The simplest solution: with u 1:12, gnuplot will simply ignore non-numerical and column values, e.g. like 45pts will be interpreted as 45.

ad 2. and 3. If you extract the last column as string, gnuplot will fail and stop if you want to convert a non-numerical value via real() into a floating point number. Hence, you have to test yourself via your own function isNumber() if the column value at least starts with a number and hence can be converted by real(). In case the string is not a number you could set the value to 1/0 or NaN. However, in earlier gnuplot versions the line of a lines(points) plot will be interrupted. Whereas in newer gnuplot versions (>=4.6.0) you could set the value to NaN and avoid interruptions via set datafile missing NaN which, however, is not available in gnuplot 4.4. Furthermore, in gnuplot 4.4 NaN is simply set to 0.0 (GPVAL_NAN = 0.0). You can workaround this with this "trick" which is also used below.

Data: SO7353702.dat

2010/01/12/ 12:00 some un related alapha 129495 and the interesting value 45pts
2010/01/12/ 15:00 some un related alapha 129495 and no interesting value
2010/01/13/ 09:00 some un related alapha 345678 and the interesting value 60pts
2010/01/15/ 09:00 some un related alapha 345678 62pts and nothing
2010/01/17/ 09:00 some un related alapha 345678 and nothing
2010/01/18/ 09:00 some un related alapha 345678 and the interesting value 70.5pts
2010/01/19/ 09:00 some un related alapha 345678 and the interesting value extra extra 64pts
2010/01/20/ 09:00 some un related alapha 345678 and the interesting value 0.66e2pts

Script: (works for gnuplot>=4.4.0, March 2010)

### extract numbers without external tools
reset
FILE = "SO7353702.dat"

set xdata time
set timefmt "%Y/%m/%d/ %H:%M"
set format x "%b %d"
isNumber(s) = strstrt('+-.',s[1:1])>0 && strstrt('0123456789',s[2:2])>0 \
              || strstrt('0123456789',s[1:1])>0

# Version 1:
plot FILE u 1:12 w lp pt 7 ti "value in the 12th column"
pause -1

# Version 2:
set datafile separator "\t"
getLastValue(col) = (s=word(strcol(col),words(strcol(col))), \
                     isNumber(s) ? (t0=t1, real(s)) :  (y0))
plot t0=NaN FILE u (t1=timecolumn(1), y0=getLastValue(1), t0) : (y0) w lp pt 7 \
        ti "value in the last column"
pause -1

# Version 3:
getPts(s) = (c=strstrt(s,"pts"), c>0 ? (r=s[1:c-1], p=word(r,words(r)), isNumber(p) ? \
            (t0=t1, real(p)) : y0) : y0)
plot t0=NaN FILE u (t1=timecolumn(1),y0=getPts(strcol(1)),t0):(y0) w lp pt 7 \
            ti "value anywhere with trailing 'pts'"
### end of script

Result:

Version 1:

Version 2:

Version 3:

Chris · Answer 2 · 2011-09-09T04:01:25.713

1

try:

awk 'NF==12{sub(/pts/,"",$12);printf "%s %s, %s ", $1, $2, $12}' file

Input:

2010/01/12/ 12:00 some un related alapha 129495 and the interesting value 45pts
2010/01/12/ 15:00 some un related alapha 129495 and no interesting value
2010/01/13/ 09:00 some un related alapha 345678 and the interesting value 60pts

Output:

2010/01/12/ 12:00, 45 2010/01/13/ 09:00, 60

Updated for your new requirements:

Command:

awk 'NF==12{gsub(/\//,"-",$1)sub(/pts/,"",$12);printf "%s%s %s \n", $1, $2, $12}' file

Output:

2010-01-12-12:00 45 
2010-01-13-09:00 60

HTH Chris

edited Sep 09 '11 at 04:01

answered Sep 08 '11 at 20:05

Chris

2,987
2
20
21

sorry, i noticed that the line was not broken in the write place of my csv file, i've amended it so that it is, how to change the awk program above to print the correct csv file? – chris Sep 08 '11 at 20:12
This awk 'NF==12{gsub(/\//,"-",$1)sub(/pts/,"",$12);printf "%s%s %s \n", $1, $2, $12}' file gives me "2010-01-12-12:00 45 2010-01-13-09:00 60" with line breaks (not shown here). – Chris Sep 08 '11 at 20:40
@Chris : it's probably better to edit your posted answer, and leave a comment to indicate the edit. Good luck to all! – shellter Sep 08 '11 at 21:10

score 1 · Answer 3 · answered Sep 09 '11 at 02:35

1

Bash

#!/bin/bash

while read -r a b line
do
  [[ $line =~ ([0-9]+)pts$ ]] && echo "$a $b, ${BASH_REMATCH[1]}"
done < file

answered Sep 09 '11 at 02:35

bash-o-logist

6,665
1
17
14

carlpett · Answer 4 · 2011-09-08T20:26:07.137

0

It is indeed possible. A regex such as this one, for instance:

sed -n 's!([0-9]{4}/[0-9]{2}/[0-9]{2}/ [0-9]{2}:[0-9]{2}).*([0-9]+)pts!\1, \2!p'

edited Sep 08 '11 at 20:26

answered Sep 08 '11 at 19:59

carlpett

12,203
5
48
82

when i execute the above i get the error: sed: 1: "s!([0-9]{4}/[0-9]{2}/[0 ...": \1 not defined in the RE can you explain what the command is trying to do? – chris Sep 08 '11 at 20:14
It collects the parts you are intrested in: `([0-9]{4}/[0-9]{2}/[0-9]{2}/ [0-9]{2}:[0-9]{2})` is a regex matching your date string. Then `.*` throws away anything up until a number of digits followed by `pts`, and saves those digits. Then it prints those two groups. What version of `sed` are you using? – carlpett Sep 08 '11 at 20:19
using gnu sed 4.2.1 i get the error: sed: -e expression #1, char 71: invalid reference \2 on `s' command's RHS – chris Sep 08 '11 at 20:23
Interesting... I'm using 4.2.1 too, and it works for me. Did you copy the command or type it? It's quite easy to miss something when trying to type these long commands... – carlpett Sep 08 '11 at 20:27
I guess you could try adding a `-r`, I thought it would be needed, but my `sed` wiered out on using it... – carlpett Sep 08 '11 at 20:33
1

Regular sed requires backslashes in front of `{}` and `()` as metacharacters in regular expressions. There is sometimes an option to change that so the code works as written. – Jonathan Leffler Sep 09 '11 at 03:56

Kent · Answer 5 · 2011-09-08T21:09:36.310

0

awk '/pts/{ gsub(/pts/,"",$12);print $1,$2", "$12}' yourFile

output:

2010/01/12/ 12:00, 45
2010/01/13/ 09:00, 60

[Update:based on your new requirement]

How can i modify the above to look like:
2010-01-12-12:00 45 
2010-01-13-09:00 60

awk '/pts/{ gsub(/pts/,"",$12);a=$1$2OFS$12;gsub(/\//,"-",a);print a}' yourFile

the cmd above will give you:

2010-01-12-12:00 45
2010-01-13-09:00 60

edited Sep 08 '11 at 21:09

answered Sep 08 '11 at 20:15

Kent

189,393
32
233
301

thanks! just realised gnuplot expects the values to be separated by a space. How can i modify the above to look like: 2010-01-12-12:00 45 2010-01-13-09:00 60 thanks, i'm almost there! – chris Sep 08 '11 at 20:29
@norm, just change the print statement to: `print $1, $2, $12` -- remove the literal, quoted comma. – glenn jackman Sep 08 '11 at 20:38

score 0 · Answer 6 · answered Sep 09 '11 at 03:28

0

sed can be made more readable:

nn='[0-9]+'
n6='[0-9]{6}'
n4='[0-9]{4}'
n2='[0-9]{2}'
rx="^($n4/$n2/$n2/ $n2:$n2) .+ $n6 .+ ($nn)pts$"

sed -nre "s|$rx|\1 \2|p" file

output

2010/01/12/ 12:00 45
2010/01/13/ 09:00 60

answered Sep 09 '11 at 03:28

Peter.O

6,696
4
30
37

score 0 · Answer 7 · answered Sep 09 '11 at 04:50

I'd do that in two pipeline stages, first awk then sed:

awk '$NF ~ /[[:digit:]]+pts/ { print $1, $2", "$NF }' | 
  sed 's/pts$//'

By using $NF instead of a fixed number, you work with the final field, regardless of what the unrelated text looks like and how many fields it occupies.

How to extract multiple params from string using sed or awk

7 Answers7

[Update:based on your new requirement]