1

I am trying to scrape text from paragraphs with different id names. The text looks as follows:

<p id="comFull1" class="comment" style="display:none"><strong>Comment:
</strong><br>I realized how much Abilify has been helping me when I recently 
tried to taper off of it. I am on the bipolar spectrum, with mainly 
depression and some OCD symptoms. My obsessive, intrusive thoughts came 
racing back when I decreased the medication. I also got much more tired and 
had insomnia with the decrease. am not happy with side effects of 15 lb 
weight gain, increased cholesterol and a flat effect on my emotions. I am 
actually wondering if an increase from the 7 mg would help even more...for 
now I&#39;m living with the side effects.<br><a 
onclick="toggle('comTrunc1'); toggle('comFull1');return false;" 
href="#">Hide Full Comment</a></p>

<p id="comFull2" class="comment" style="display:none"><strong>Comment:
</strong><br>It&#39;s worked Very well for me. I&#39;m sleeping I&#39;m 
eating I&#39;m going Out in the public. Overall I&#39;m very 
satisfied.However I haven&#39;t heard anybody mention this but my feet are 
very puffy and swollen is this a side effect does anyone know?<br><a 
onclick="toggle('comTrunc2'); toggle('comFull2');return false;" 
href="#">Hide Full Comment</a></p>

......

I am able to scrap text only from a particular id but not with all id at a time. Can anyone help on this issue to scrap text from all ids. The code looks like this

>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required2 = soup.find("p", {"id": "comFull1"}).text
>>> required2
"Comment:I realized how much Abilify has been helping me when I recently 
tried to taper off of it. I am on the bipolar spectrum, with mainly 
depression and some OCD symptoms. My obsessive, intrusive thoughts came 
racing back when I decreased the medication. I also got much more tired and 
had insomnia with the decrease. am not happy with side effects of 15 lb 
weight gain, increased cholesterol and a flat effect on my emotions. I am 
actually wondering if an increase from the 7 mg would help even more...for 
now I'm living with the side effects.Hide Full Comment"
SIM
  • 21,997
  • 5
  • 37
  • 109
Ashok Kumar Jayaraman
  • 2,887
  • 2
  • 32
  • 40

3 Answers3

1

The issue you are having as understood by me is to scrape the text of all paragraphs in a webpage or <\p> tags.

The function you are looking for is -

soup.find_all('p')

A more comprehensive example is shown in the following docs -

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

1

If you want to use xpath you can use

response.xpath("//p[contains(@id,'comFull')]/text()").extract()

But since you are using beautiful soup you can pass a function or regular expression to find_all method as mentioned here. Matching id's in BeautifulSoup

soup.find_all('p', id=re.compile('^comFull-'))
Gaur93
  • 685
  • 7
  • 19
1

Try this. If all the ID numbers containing paragraphs are suffixed 1,2,3 e.t.c to it, as in comFull1,comFull2,comFull3 then the below selector should handle it.

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

soup = BeautifulSoup(content, "html.parser")
for item in soup.select("[id^='comFull']"):
    print(item.text)
SIM
  • 21,997
  • 5
  • 37
  • 109