1

I am learning python web scraping and I am testing it and so far its giving me what I need but one record.

Upon checking the id I am testing scraping on is appended by some characters before. e.g.:

 id="List_1__firstName"

So I want to get records using part of the id more like %%_firstName.

_firstName

How do I go about this? This is my current code:

import requests
from bs4 import BeautifulSoup

url = 'https://****.co**/'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
Name = soup.find(id='List_l0_l00_firstName').get_text()
print(Name)

1 Answers1

2

Maybe the following question helps you.

Matching partial ids in BeautifulSoup

You can use find_all.

import re
soup.find_all(id=re.compile('_firstName$'))

Additional comment

Here is my local testing script and result which mentioned in below comment.

That's strange... Actually, I've tried in my local to test it and worked. I've added my console output, python script, and html file in my answer above just now.

output

$ ls 
index.html  main.py

$ python3 main.py
foo
bar
baz

python script (main.py)

import bs4
import re

soup = bs4.BeautifulSoup(open('index.html'), 'html.parser')
elements = soup.find_all(id=re.compile('_firstName$'))

for el in elements:
  print(el.get_text())

html file (index.html)

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>test</title>
</head>
<body>
  <div id="List_1__firstName">foo</div>
  <div id="List_2__firstName">bar</div>
  <div id="List_3__firstName">baz</div>
</body>
</html>

my environment

$ python --version
Python 3.7.6

$ pip show beautifulsoup4
Name: beautifulsoup4
Version: 4.9.3
Summary: Screen-scraping library
Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
Author-email: leonardr@segfault.org
License: MIT
Location: /Users/ntb/envs/bs/lib/python3.7/site-packages
Requires: soupsieve
Required-by:
naru
  • 96
  • 7
  • Thanks for the pointer, I ended up using ``` Name = soup.findAll('span', id=re.compile('_firstName')) ``` Its giving me all the span tags with data, I am interested in getting only the data for that span or div. Initially with get_text(), I could get only a single record. – Bongani Napoleon Dlamini Nov 26 '20 at 00:27
  • I think you can iterate the result and call `get_text()` for each element. – naru Nov 26 '20 at 03:51
  • Its giving me this error attributeError: 're.Pattern' object has no attribute 'get_text' – Bongani Napoleon Dlamini Nov 26 '20 at 11:05
  • That's strange... Actually, I've tried in my local to test it and worked. I've added my console output, python script, and html file in my answer above just now. – naru Nov 27 '20 at 06:32