I'm still pretty new to web-scraping, so apologies if this question is a bit simple.
I'm using command-line attempting to scrape a news website for a list of urls nested in an h2 tag. I am able to mostly isolate the list I want with grep.
curl https://myurl.com | grep 'h2'
However, there's a portion of code above my list of urls containing unrelated links that have the same class and heading– I want to remove these. In the code, it's clear where my desired list begins because there is consistently this line of html directly above it.
<span class="cb-date"><time class="entry-date updated" datetime="2020-11-09">November 13, 2020</time></span></div></div></li></ul></div></div></li>
#my list begins
<h2 class="cb-post-title"><a href="https://www.myurl.com">Article Heading Here</a></h2>
I was hoping to use sed to remove everything before the last occurrence of the tag </time>
or perhaps </li>
and then pipe it. I tried something like this, but keep getting "failure writing output to destination."
curl https://myurl.com | grep 'h2' | sed -n '/</time>/h;//!H;$!d;x;//p'
Not really sure how to proceed, so any help would be truly appreciated. Please do let me know if there is a way I should be doing this other than sed. Cheers.