Tweets are found under a class called TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text
.
I tried scraping with BeautifulSoup 4 to obtain the Tweet text only, but I don't get the Tweet text, I only get the class printed. How can I get only the Tweet text? I only want the part highlighted in yellow in the screenshot below:
Script
import requests, bs4
from bs4 import BeautifulSoup
link = "https://web.archive.org/web/20220210162643/https://twitter.com/toteskosh/status/1491809570997555201"
r = requests.get(link).text
soup = bs4.BeautifulSoup(r, "html")
tweet_text = soup.find("p", {"class": "TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text"})
content = str(tweet_text)
print(content)
Actual output:
<p class="TweetTextSize TweetTextSize--jumbo js-tw
Expected output:
i’m gonna need to take a moment to formulate an opinion and i’ll try to get back to y’all
` you're selecting. You can use `"".join(tweet_text.find_all(text=True, recursive=False))` as described [here](https://stackoverflow.com/questions/4995116/only-extracting-text-from-this-element-not-its-children) to only take the shallow text, but this might give you weird output when the link or other child text really is important to the tweet, so you might want to add other logic to account for this depending on your use case.
– ggorlen Feb 11 '22 at 18:06