Can I use sed to replace everything before the last instance of an html tag?

Question

I'm still pretty new to web-scraping, so apologies if this question is a bit simple.

I'm using command-line attempting to scrape a news website for a list of urls nested in an h2 tag. I am able to mostly isolate the list I want with grep.

curl https://myurl.com | grep 'h2'

However, there's a portion of code above my list of urls containing unrelated links that have the same class and heading– I want to remove these. In the code, it's clear where my desired list begins because there is consistently this line of html directly above it.

<span class="cb-date"><time class="entry-date updated" datetime="2020-11-09">November 13, 2020</time></span></div></div></li></ul></div></div></li>
            #my list begins
            <h2 class="cb-post-title"><a href="https://www.myurl.com">Article Heading Here</a></h2>

I was hoping to use sed to remove everything before the last occurrence of the tag </time> or perhaps </li> and then pipe it. I tried something like this, but keep getting "failure writing output to destination."

curl https://myurl.com | grep 'h2' | sed -n '/</time>/h;//!H;$!d;x;//p'

Not really sure how to proceed, so any help would be truly appreciated. Please do let me know if there is a way I should be doing this other than sed. Cheers.

See https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Diego Torres Milano, Nov 16 '20 at 03:10
`sed` is not the best tool to parse HTML or XML. It may work in special cases, but not in general. — U. Windl, Nov 16 '20 at 07:34

score 0 · Answer 1 · answered Nov 16 '20 at 11:19

0

Use https://github.com/ericchiang/pup instead. Sth like this should work:

curl ... | pup 'h2#cb-post-title'

answered Nov 16 '20 at 11:19

HappyFace

3,439
2
24
43

Can I use sed to replace everything before the last instance of an html tag?

1 Answers1