I want to scrape all the text like heading, bullets paragraph from article acept some
tags from start of the article and from end of the article

Question

I want to scrape the Article for this site

https://www.traveloffpath.com/covid-19-travel-insurance-everything-you-need-to-know/ and https://www.traveloffpath.com/what-to-do-if-your-flight-is-delayed-or-canceled/?swcfpc=1 I am stuck in the "p" tag because I don't want "p" tags from the start of the article and from the end of the article as I don't want "p" Share the article"p" and "p" last updated "p" and some "p" tag from the bottom text that is not included in the article.

Articletext = soup.find(class_="article")
for items in soup.find_all(class_="article"):
    Gather = '\n'.join([item.text for item in items.find_all(["h6","h5","h4","h3","h2","h1","p","li"])])
    filtered = Gather.split("↓ Join the community ↓")
    Content = filtered[0].split("Email")
    while True :
        try:
            Content = filtered[0].split("Email")
            
        except :
            Content = Content[1].split("ago")
        else :
            break
    # try:
    #     Content = filtered[0].split("Email")
    # except:
    #     Content = filtered[0].split("ago")
    # Content = re.split('ago | Read More:',Gather) 
    print("Content: ", Content[1])

enter image description here

Blockquote

Use a DOM parser to parse the dom and then inspect it. Don't reinvent the wheel. — psykx, Oct 04 '22 at 13:15
you can always get list with all `
` and later slice this list `[1:-1]` to get without first and without last. — furas, Oct 04 '22 at 13:57
By *'some "p" tag from the bottom text that is not included in the article'* do you mean the list of links for recommended further reading and the *"This article originally appeared...."* bit? — Driftr95, Oct 04 '22 at 17:53
I have found the answer myself, for the link above I mentioned. #=== Articletext = soup.find(class_="article") #=== for items in soup.find_all(class_="article"): #=== Gather = '\n'.join([item.text for item in items.find_all(["h6","h5","h4","h3","h2","h1","p","li"])]) #=== filtered = Gather.split("↓ Join the community ↓") #=== Content = filtered[0].split("Email") #=== word = "ago" #=== if word in Content[1]: #=== Content = Content[1].split("ago") #=== print("Content: ", Content[1]) — Info Rewind, Oct 04 '22 at 18:37

score 1 · Accepted Answer · answered Oct 04 '22 at 18:57

You could filter within the list comprehension and then find where to slice of the unwanted parts at the end:

for items in soup.select('article.article'):
tags = [
      t for t in items.find_all(["h6","h5","h4","h3","h2","h1","p","li"]) 
      if not (t.name in ['p', 'li'] and (
          ('class' in t.attrs and t.attrs['class']) or
          ('id' in t.attrs and t.attrs['id'])
      ))
  ] # filtered out "Share..." and "Last Updated..."
  tLen = len(tags)
  for i in list(range(tLen))[::-1]: #counting down from last tag
    if tags[i].name == 'h3': 
      tags = tags[:i]
      break
  
  articleText = '\n'.join([t.text for t in tags])
  print(articleText)

with that, you'll be able to get rid of the paragraph with the list of links for further reading. If you want up to just before the "↓ Join the community ↓" part like in your code, just change to if tags[i].name == 'h5': instead of h3, and if you want all the way to the end only skipping the "subscribe..." section , you'd just need to change that if block to

if tags[i].name == 'h5':
    tags = tags[:i] + tags[i+1:]
    break

I want to scrape all the text like heading, bullets paragraph from article acept some tags from start of the article and from end of the article

1 Answers1

I want to scrape all the text like heading, bullets paragraph from article acept some
tags from start of the article and from end of the article