-2

I'm still pretty new to web-scraping, so apologies if this question is a bit simple.

I'm using command-line attempting to scrape a news website for a list of urls nested in an h2 tag. I am able to mostly isolate the list I want with grep.

curl https://myurl.com | grep 'h2'

However, there's a portion of code above my list of urls containing unrelated links that have the same class and heading– I want to remove these. In the code, it's clear where my desired list begins because there is consistently this line of html directly above it.

<span class="cb-date"><time class="entry-date updated" datetime="2020-11-09">November 13, 2020</time></span></div></div></li></ul></div></div></li>
            #my list begins
            <h2 class="cb-post-title"><a href="https://www.myurl.com">Article Heading Here</a></h2>

I was hoping to use sed to remove everything before the last occurrence of the tag </time> or perhaps </li> and then pipe it. I tried something like this, but keep getting "failure writing output to destination."

curl https://myurl.com | grep 'h2' | sed -n '/</time>/h;//!H;$!d;x;//p'

Not really sure how to proceed, so any help would be truly appreciated. Please do let me know if there is a way I should be doing this other than sed. Cheers.

1 Answers1

0

Use https://github.com/ericchiang/pup instead. Sth like this should work:

curl ... | pup 'h2#cb-post-title'
HappyFace
  • 3,439
  • 2
  • 24
  • 43