0

Tweets are found under a class called TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text.

I tried scraping with BeautifulSoup 4 to obtain the Tweet text only, but I don't get the Tweet text, I only get the class printed. How can I get only the Tweet text? I only want the part highlighted in yellow in the screenshot below: tweet_text

Script

import requests, bs4
from bs4 import BeautifulSoup

link = "https://web.archive.org/web/20220210162643/https://twitter.com/toteskosh/status/1491809570997555201"

r = requests.get(link).text


soup = bs4.BeautifulSoup(r, "html")

tweet_text = soup.find("p", {"class": "TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text"})


content = str(tweet_text)


print(content)

Actual output:

<p class="TweetTextSize TweetTextSize--jumbo js-tw

Expected output:

i’m gonna need to take a moment to formulate an opinion and i’ll try to get back to y’all

3 Answers3

2

Updated

This works fine for me

import requests, bs4
from bs4 import BeautifulSoup

link = "https://web.archive.org/web/20220210162643/https://twitter.com/toteskosh/status/1491809570997555201"
r = requests.get(link).text
soup = bs4.BeautifulSoup(r, "lxml")
tweet_text = soup.find("p", {"class": "TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text"})
dest = soup.find('a', {"class": "twitter-timeline-link u-hidden"})
dest.decompose()
content = tweet_text.text

print(content)
  • Thanks, joker 010! However, it prints out `i’m gonna need to take a moment to formulate an opinion and i’ll try to get back to y’allhttps://twitter.com/hartorah/status/1491764170588626948 …` What if I want to only have the text, not the URL appended to its end? Thanks. – facialrecognition Feb 11 '22 at 18:08
  • 2
    @facialrecognition I updated the code and it works fine, all you have to is to decompose the element that contains the link –  Feb 11 '22 at 18:15
  • Thank you so much!!! – facialrecognition Feb 11 '22 at 18:17
1

This will code will work atleast for me it did I thinks you got your output because your were using html instead of lxml

import requests
from bs4 import BeautifulSoup
#needed for bs4
import lxml

#url to the tweet
url = 'https://web.archive.org/web/20220210162643/https://twitter.com/toteskosh/status/1491809570997555201'

#get the page source
tweet = requests.get(url).text

soup = BeautifulSoup(tweet, "lxml")

#finds the tweet
tweet_text = soup.find("p", class_='TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text').text)
print(tweet_text)
J3ldo
  • 11
  • 3
1

I was able to get the text from the tweet, i then used regex to remove a link of a child element that was showing up. Hope it helped!

import bs4
from bs4 import BeautifulSoup
import re
link = "https://web.archive.org/web/20220210162643/https://twitter.com/toteskosh/status/1491809570997555201"

r = requests.get(link).text


soup = bs4.BeautifulSoup(r, "html.parser")

tweet_text = soup.find(
    "p", class_="TweetTextSize--jumbo").text
content = re.sub(r'http\S+', '', tweet_text)

strencode = content.encode("ascii", "ignore")
strdecode = strencode.decode()

print(strdecode)```
Abhay S
  • 11
  • 2
  • Thank you, Abhay! Quick question: the text printed has an ellipsis at the end. Is there a way to get rid of it? `i’m gonna need to take a moment to formulate an opinion and i’ll try to get back to y’all …` – facialrecognition Feb 11 '22 at 18:14
  • yep sorry about that, you just have to add these lines after `content = re.sub('http\S+', '', tweet_text)` and then you print strdecode. that was a "U+2026" unicode character and I was able to remove it here. `strencode = content.encode("ascii", "ignore")` `strdecode = strencode.decode()` `print(strdecode)` – Abhay S Feb 11 '22 at 18:31
  • i've also edited the main solution that i've posted to include these changes – Abhay S Feb 11 '22 at 18:34
  • Thank you so much!!! You don't have to answer at all, but if you know a way to append the text with the child element link without the ellipsis, that'd be super awesome. Otherwise, thank you so much, your help meant a lot! – facialrecognition Feb 12 '22 at 03:54