Trying to extract some data from a webpage (scraping beginner)

Question

I'm trying to extract some data from a webpage using Requests and then Beautifulsoup. I started by getting the html code with Requests and then "putting it" in Beautifulsoup:

from bs4 import BeautifulSoup
import requests


result = requests.get("https://XXXXX")
#print(result.status_code)
#print(result.headers)
src = result.content
soup = BeautifulSoup(src, 'lxml')

Then I singled out some pieces of code:

tags = soup.findAll('ol',{'class':'activity-popup-users'})

print(tags)

Here is a part of what I got:

<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">

What I want now is to extract the data after data-user-id=which consists of numbers between "". Then I would like that data to be entered into some kind of calc sheet. I am an absolute beginner and I'm postly pasting code I found elsewhere on tutorials or documentation. Thanks a lot for your time...

EDIT: So here's what I tried:

from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXX")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
tags = soup.findAll('ol',{'class':'activity-popup-users'})
print(tags['data-user-id'])

And here's what I got:

TypeError: list indices must be integers or slices, not str

So I tried that:

from bs4 import BeautifulSoup 
import requests 
result = requests.get("https://XXXX") 
src = result.content soup = BeautifulSoup(src, 'html.parser')
#tags = soup.findAll('a',{'class':'account-group js-user-profile-link'}) 
tags = soup.findAll('ol',{'class':'activity-popup-users'}) 
tags.attrs
#print(tags['data-user-id'])

And got:

File "C:\Users\XXXX\element.py", line 1884, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key

AttributeError: ResultSet object has no attribute 'attrs'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Bitto · Accepted Answer · 2019-02-20T20:30:10.853

1

You can get any attribute value of a tag by treating the tag like an attribute-value dictionary.

Read the BeautifulSoup documentation on attributes.

tag['data-user-id']

For example

html="""
<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')
tag=soup.find('div')
print(tag['data-user-id'])

Output

3787869561

Edit to include OP's question change:

from bs4 import BeautifulSoup
import requests
result = requests.get("http://twitter.com/RussiaUN/media")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
divs = soup.find_all('div',class_='account')
#just print
for div in divs:
    print(div['data-user-id'])
#write to a file    
with open('file.txt','w') as f:
   for div in divs:
        f.write(div['data-user-id']+'\n')

Output:

edited Feb 20 '19 at 20:30

answered Feb 20 '19 at 18:09

Bitto

7,937
1
16
38

Thank you! I did that: `from bs4 import BeautifulSoup import requests result = requests.get("https://twitter.com/RussiaUN/media") #print(result.status_code) #print(result.headers) src = result.content soup = BeautifulSoup(src, 'html.parser') #tags = soup.findAll('a',{'class':'account-group js-user-profile-link'}) #tags = soup.findAll('ol',{'class':'activity-popup-users'}) tag = soup.find('div') print(tag['data-user-id']) #print(tags.attrs)` And it says that: `KeyError: 'data-user-id'` – Max Baldwin Feb 20 '19 at 18:28
@MaxBaldwin are you trying to get all the data-user-id from all divs with account class? – Bitto Feb 20 '19 at 18:56
Yes, I want to extract all the data-user-id from the document. – Max Baldwin Feb 20 '19 at 19:00
@MaxBaldwin See my edit. findAll returns a list of tags/ elements not a single tag. You have to loop through it to get an single tag and the do tag['data-user-id'] – Bitto Feb 20 '19 at 19:01
last question: Why did we have to put an _ between find and all and between class and ='account' ? – Max Baldwin Feb 20 '19 at 19:19
@MaxBaldwin B'coz 'class' is a built in keyword of python and this is the convention used by bs4 to avoid conflict. – Bitto Feb 20 '19 at 19:25
And how can you store the resulting data in a file? Is it possible? – Max Baldwin Feb 20 '19 at 20:10
Thanks a lot! Could you recommend a website or pdf to learn a bit by myself? – Max Baldwin Feb 20 '19 at 20:33
@MaxBaldwin https://docs.python.org/3/tutorial/index.html , https://www.crummy.com/software/BeautifulSoup/bs4/doc/, http://docs.python-requests.org/en/master/user/quickstart/ – Bitto Feb 20 '19 at 20:39

Trying to extract some data from a webpage (scraping beginner)

1 Answers1

Linked