How to extract data from HTML using beuatiful soup

Question

I am trying to scrape a web page and store the results in a csv/excel file. I am using beautiful soup for this.

I am trying to extract the data from a soup , using the find_all function, but I am not sure how to capture the data in the field name or title

The HTML file has the following format

<h3 class="font20">
 <span itemprop="position">36.</span> 
 <a class="font20 c_name_head weight700 detail_page" 
 href="/companies/view/1033/nimblechapps-pvt-ltd" target="_blank" 
 title="Nimblechapps Pvt. Ltd."> 
     <span itemprop="name">Nimblechapps Pvt. Ltd. </span>
</a> </h3>

This is my code so far. Not sure how to proceed from here

from bs4 import BeautifulSoup as BS
import requests 
page = 'https://www.goodfirms.co/directory/platform/app-development/iphone? 
page=2'
res = requests.get(page)
cont = BS(res.content, "html.parser")
names = cont.find_all(class_ = 'font20 c_name_head weight700 detail_page')
names = cont.find_all('a' , attrs = {'class':'font20 c_name_head weight700 
detail_page'})

I have tried using the following -

Input: cont.h3.a.span
Output: <span itemprop="name">Nimblechapps Pvt. Ltd.</span>

I want to extract the name of the company - "Nimblechapps Pvt. Ltd."

Post the code you have tried, and what the specific problem with it is. — Scott Hunter, Dec 25 '18 at 18:32
@ScottHunter done! Please check the edited version of the question — Keshav c, Dec 25 '18 at 19:19
To get tag attributes use `tag[attr]`, to get tag text use `tag.text`. Note that `.find_all()` returns a list of elements. If you want only the first use `.find()` or select by index. — t.m.adam, Dec 25 '18 at 19:22
@t.m.adam I want to return the list of company names on the webpage. How would you suggest I do this? — Keshav c, Dec 25 '18 at 19:28
@drec4s yes I want cont.h3.a.span.text. But I need it for all the listing provided on the webpage! I am unable to return the list — Keshav c, Dec 25 '18 at 19:29
Easy, select the text of each element, eg: `for tag in cont.find_all("span", itemprop="name"): print(tag.text)` — t.m.adam, Dec 25 '18 at 19:37

score 2 · Accepted Answer · answered Dec 25 '18 at 19:38

You can use a list comprehension for that:

from bs4 import BeautifulSoup as BS
import requests

page = 'https://www.goodfirms.co/directory/platform/app-development/iphone?page=2'
res = requests.get(page)
cont = BS(res.content, "html.parser")
names = cont.find_all('a' , attrs = {'class':'font20 c_name_head weight700 detail_page'})
print([n.text for n in names])

You will get:

['Nimblechapps Pvt. Ltd.', (..) , 'InnoApps Technologies Pvt. Ltd', 'Umbrella IT', 'iQlance Solutions', 'getyoteam', 'JetRuby Agency LTD.', 'ONLINICO', 'Dedicated Developers', 'Appingine', 'webnexs']

score 1 · Answer 2 · answered Dec 25 '18 at 20:01

1

Same thing but using descendant combinator " " to combine the type selector a with attribute = value selector [itemprop="name"]

names = [item.text for item in cont.select('a [itemprop="name"]')]

answered Dec 25 '18 at 20:01

QHarr

83,427
12
54
101

score 1 · Answer 3 · answered Dec 25 '18 at 20:01

Try not to use compound classes within the script as they are prone to break. The following script should fetch you the required content as well.

import requests
from bs4 import BeautifulSoup

link = "https://www.goodfirms.co/directory/platform/app-development/iphone?page=2"

res = requests.get(link)
soup = BeautifulSoup(res.text, 'html.parser')
for items in soup.find_all(class_="commoncompanydetail"):
    names = items.find(class_='detail_page').text
    print(names)

How to extract data from HTML using beuatiful soup

3 Answers3