How to use BeautifulSoup 4 to extract Tweet text?

Question

Tweets are found under a class called TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text.

I tried scraping with BeautifulSoup 4 to obtain the Tweet text only, but I don't get the Tweet text, I only get the class printed. How can I get only the Tweet text? I only want the part highlighted in yellow in the screenshot below:

Script

import requests, bs4
from bs4 import BeautifulSoup

link = "https://web.archive.org/web/20220210162643/https://twitter.com/toteskosh/status/1491809570997555201"

r = requests.get(link).text


soup = bs4.BeautifulSoup(r, "html")

tweet_text = soup.find("p", {"class": "TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text"})


content = str(tweet_text)


print(content)

Actual output:

<p class="TweetTextSize TweetTextSize--jumbo js-tw

Expected output:

i’m gonna need to take a moment to formulate an opinion and i’ll try to get back to y’all

Just call `tweet_text.text` on the element to get its text. `tweet_text` isn't actually text, it's the element itself. `r` isn't actually a response, either, it's the text content of the response. It's a good idea to name things what they are to avoid confusing yourself. `dir()` and `type()` are your friends (as well as the docs)! Also, I suggest using `"lxml"` as your HTML parser. — ggorlen, Feb 11 '22 at 17:55
Thank you for your help!! I added the `.text` part (to get the text for that element), and I also changed the parser from html to lxml, and now I'm getting closer to where I want to be. However, my output has a weird URL appended at the end that I don't want. Do you know why that is happening? I want the output to look like the one above (expected output), but instead I'm getting `i’m gonna need to take a moment to formulate an opinion and i’ll try to get back to y’allhttps://twitter.com/hartorah/status/1491764170588626948 …` — facialrecognition, Feb 11 '22 at 18:00
That's part of the text because the `` is a child of the `
` you're selecting. You can use `"".join(tweet_text.find_all(text=True, recursive=False))` as described [here](https://stackoverflow.com/questions/4995116/only-extracting-text-from-this-element-not-its-children) to only take the shallow text, but this might give you weird output when the link or other child text really is important to the tweet, so you might want to add other logic to account for this depending on your use case. — ggorlen, Feb 11 '22 at 18:06
Thank you so much, @ggorlen! This is very helpful. Does the above .join(function work even if a URL is attached without spaces to the text (like in the above example, there's no space between `y'all` and the URL)? — facialrecognition, Feb 11 '22 at 18:10
Not to sound snarky, but why not try it in the interpreter and see? The spaces are irrelevant because the code works on the element tree, not the text. Again -- I think you may have a fundamental misunderstanding between a _HTML tree node_ and _its string text content_. You could also strip the URL from the text after dumping the tree to string, but that seems a bit more brittle to me. — ggorlen, Feb 11 '22 at 18:11
Sorry about that. I just tried it out, and it works brilliantly. I thank you again for your help, it means so much, and I learned something new today. Have a good day! — facialrecognition, Feb 11 '22 at 18:17

score 2 · Accepted Answer · 2022-02-11T18:14:52.447

2

Updated

This works fine for me

import requests, bs4
from bs4 import BeautifulSoup

link = "https://web.archive.org/web/20220210162643/https://twitter.com/toteskosh/status/1491809570997555201"
r = requests.get(link).text
soup = bs4.BeautifulSoup(r, "lxml")
tweet_text = soup.find("p", {"class": "TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text"})
dest = soup.find('a', {"class": "twitter-timeline-link u-hidden"})
dest.decompose()
content = tweet_text.text

print(content)

edited Feb 11 '22 at 18:14

answered Feb 11 '22 at 18:04

Thanks, joker 010! However, it prints out `i’m gonna need to take a moment to formulate an opinion and i’ll try to get back to y’allhttps://twitter.com/hartorah/status/1491764170588626948 …` What if I want to only have the text, not the URL appended to its end? Thanks. – facialrecognition Feb 11 '22 at 18:08
2

@facialrecognition I updated the code and it works fine, all you have to is to decompose the element that contains the link – Feb 11 '22 at 18:15
Thank you so much!!! – facialrecognition Feb 11 '22 at 18:17

J3ldo · Answer 2 · 2022-02-11T18:11:24.690

1

This will code will work atleast for me it did I thinks you got your output because your were using html instead of lxml

import requests
from bs4 import BeautifulSoup
#needed for bs4
import lxml

#url to the tweet
url = 'https://web.archive.org/web/20220210162643/https://twitter.com/toteskosh/status/1491809570997555201'

#get the page source
tweet = requests.get(url).text

soup = BeautifulSoup(tweet, "lxml")

#finds the tweet
tweet_text = soup.find("p", class_='TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text').text)
print(tweet_text)

edited Feb 11 '22 at 18:11

answered Feb 11 '22 at 18:08

J3ldo

11
3

Thank you, I will let someone mark the question as duplicate. – facialrecognition Feb 11 '22 at 18:11

Abhay S · Answer 3 · 2022-02-11T18:33:48.030

1

I was able to get the text from the tweet, i then used regex to remove a link of a child element that was showing up. Hope it helped!

import bs4
from bs4 import BeautifulSoup
import re
link = "https://web.archive.org/web/20220210162643/https://twitter.com/toteskosh/status/1491809570997555201"

r = requests.get(link).text


soup = bs4.BeautifulSoup(r, "html.parser")

tweet_text = soup.find(
    "p", class_="TweetTextSize--jumbo").text
content = re.sub(r'http\S+', '', tweet_text)

strencode = content.encode("ascii", "ignore")
strdecode = strencode.decode()

print(strdecode)```

edited Feb 11 '22 at 18:33

answered Feb 11 '22 at 18:10

Abhay S

11
2

Thank you, Abhay! Quick question: the text printed has an ellipsis at the end. Is there a way to get rid of it? `i’m gonna need to take a moment to formulate an opinion and i’ll try to get back to y’all …` – facialrecognition Feb 11 '22 at 18:14
yep sorry about that, you just have to add these lines after `content = re.sub('http\S+', '', tweet_text)` and then you print strdecode. that was a "U+2026" unicode character and I was able to remove it here. `strencode = content.encode("ascii", "ignore")` `strdecode = strencode.decode()` `print(strdecode)` – Abhay S Feb 11 '22 at 18:31
i've also edited the main solution that i've posted to include these changes – Abhay S Feb 11 '22 at 18:34
Thank you so much!!! You don't have to answer at all, but if you know a way to append the text with the child element link without the ellipsis, that'd be super awesome. Otherwise, thank you so much, your help meant a lot! – facialrecognition Feb 12 '22 at 03:54

How to use BeautifulSoup 4 to extract Tweet text?

Script

Actual output:

Expected output:

3 Answers3