Limit Sed print section of file btw 2 regexp to first occurrence

Question

I am parsing text weather data : http://www.nws.noaa.gov/view/prodsByState.php?state=OH&prodtype=hourly and want to only grab data for my county/area. The trick is that each text report has previous reports from earlier in the day and I'm only interested in the latest which appears towards the beginning of the file. I attempted to use the "print section of file between two regular expressions (inclusive)" from the sed one liners. I couldn't figure out how to get it to stop after one occurrence.

sed -n '/OHZ061/,/OHZ062/p' /tmp/weather.html

I found this: Sed print between patterns the first match result which works with the following

sed -n '/OHZ061/,$p;/OHZ062/q' /tmp/weather.html

but I feel like it isn't the most robust of solutions. I don't have anything to back up the statement of robustness but I have a gut feeling that there might be a more robust solution.

So are there any better solutions out there? Also is it possible to get my first attempted solution to work? And if you post a solution please give an explanation of all the switches/backreference/magic as I'm still trying to discover all the power of sed and command line tools.

And to help start you off:

wget -q "http://www.nws.noaa.gov/view/prodsByState.php?state=OH&prodtype=hourly" -O /tmp/weather.html

ps: I looked at this post:http://www.unix.com/shell-programming-scripting/167069-solved-sed-awk-print-between-patterns-first-occurrence.html but the sed was completely greek to me and I couldn't muddle through it to get it to work for my problem.

It is in fact a correct way to do what you want. The solution offered by your link is basically the same but tries to combine the whole section in one block before printing it. — aragaer, Mar 14 '13 at 15:50
"the sed was completely greek to me and I couldn't muddle through it to get it to work for my problem" - then why oh why would you want to learn more about it? Just use a different tool with a much clearer syntax (e.g. awk). — Ed Morton, Mar 15 '13 at 03:48

score 1 · Answer 1 · answered Mar 14 '13 at 22:01

Not sed because I don't like to parse HTML with that tool, but here you have a solution using perl with the help of a HTML parser, HTML::TreeBuilder. Code is commented step by step, I think it's easy to follow.

Content of script.pl:

#!/usr/bin/env perl

use warnings;
use strict;
use HTML::TreeBuilder;

##
## Get content of the web page.
##
open my $fh, '-|', 'wget -q -O- "http://www.nws.noaa.gov/view/prodsByState.php?state=OH&prodtype=hourly"' or die;

##
## Parse content into a tree structure.
##
my $tree = HTML::TreeBuilder->new;
$tree->parse_file( $fh ) || die;

## 
## Content is inside <pre>...</pre>, so search it in scalar context to get only
## the first one (the newest).
##
my $weather_data = $tree->find_by_tag_name( 'pre' )->as_text or die;

##
## Split data in "$$' and discard all tables of weather info but the first one.
##
my $last_weather_data = (split /(?m)^\$\$/, $weather_data, 2)[0];

## 
## Remove all data until the pattern "OHZ + digits" found in the text
##
$last_weather_data =~ s/\A.*(OHZ\d{3}.*)\z/$1/s;

## 
## Print result.
##
printf qq|%s\n|, $last_weather_data;

Run it like:

perl script.pl

And at 23:00 of 14-March-2013 it yields:

OHZ001>008-015>018-024>027-034-035-043-044-142300-
   NORTHWEST OHIO

CITY           SKY/WX    TMP DP  RH WIND       PRES   REMARKS
DEFIANCE       MOSUNNY   41  18  39 W7G17     30.17F
FINDLAY        SUNNY     39  21  48 W13       30.17F
TOLEDO EXPRESS SUNNY     41  19  41 W14       30.16F
TOLEDO METCALF MOSUNNY   42  21  43 W9        30.17S
LIMA           MOSUNNY   38  22  52 W12       30.18S

Beautiful. Thanks much for the detailed answer. I'll have to look into Perl's power as well as awk. — N Klosterman, Mar 18 '13 at 14:07

score 1 · Accepted Answer · answered Mar 15 '13 at 03:46

1

sed is an excellent tool for simple substitutions on a single line. For anything else, just use awk:

awk '/OHZ061/{found=1} found{print; if(/OHZ062/) exit}' /tmp/weather.html

answered Mar 15 '13 at 03:46

Ed Morton

188,023
17
78
185

This makes sense. In retrospect I didn't need to do any substitutions but I was just in the habit of using sed that I was trying to force it to do work it really wasn't intended to do. Thanks for the insight. Will have to grab a pick on awk and expand my horizons. – N Klosterman Mar 18 '13 at 14:06

Limit Sed print section of file btw 2 regexp to first occurrence

2 Answers2