4

I am using Goose to read the title/text-body of an article from a URL. However, this does not work with a twitter URL, I guess due to the different HTML tag structure. Is there a way to read the tweet text from such a link?

One such example of a tweet (shortened link) is as follows:

https://twitter.com/UniteAlbertans/status/899468829151043584/photo/1

NOTE: I know how to read Tweets through twitter API. However, I am not interested in that. I just want to get the text by parsing the HTML source without all the twitter authentication hassle.

utengr
  • 3,225
  • 3
  • 29
  • 68

1 Answers1

6

Scrape yourself

Open the url of the tweet, pass to HTML parser of your choice and extract the XPaths you are interested in.

Scraping is discussed in: http://docs.python-guide.org/en/latest/scenarios/scrape/

XPaths can be obtained by right-clicking to element you want, selecting "Inspect", right clicking on the highlighted line in Inspector and selecting "Copy" > "Copy XPath" if the structure of the site is always the same. Otherwise choose properties that define exactly the object you want.

In your case:

//div[contains(@class, 'permalink-tweet-container')]//strong[contains(@class, 'fullname')]/text()

will get you the name of the author and

//div[contains(@class, 'permalink-tweet-container')]//p[contains(@class, 'tweet-text')]//text()

will get you the content of the Tweet.

The full working example:

from lxml import html
import requests
page = requests.get('https://twitter.com/UniteAlbertans/status/899468829151043584')
tree = html.fromstring(page.content)
tree.xpath('//div[contains(@class, "permalink-tweet-container")]//p[contains(@class, "tweet-text")]//text()')

results in:

['Breaking:\n10 sailors missing, 5 injured after USS John S. McCain collides with merchant vessel near Singapore...\n\n', 'https://www.', 'washingtonpost.com/world/another-', 'us-navy-destroyer-collides-with-a-merchant-ship-rescue-efforts-underway/2017/08/20/c42f15b2-8602-11e7-9ce7-9e175d8953fa_story.html?utm_term=.e3e91fff99ba&wpisrc=al_alert-COMBO-world%252Bnation&wpmk=1', u'\xa0', u'\u2026', 'pic.twitter.com/UiGEZq7Eq6']
petrpulc
  • 940
  • 6
  • 22
  • Just to clarify the XPath used... `//` - search anywhere for `div[contains(@class, 'permalink-tweet-container')]` - div with class 'permalink-tweet-container' `//` - and anywhere from that a `strong[contains(@class, 'fullname')]` - strong that contains class 'fullname' `/` - from which directly `text()` - get the text. – petrpulc Aug 23 '17 at 10:40
  • You can test your own XPath for example on http://videlibri.sourceforge.net/cgi-bin/xidelcgi – petrpulc Aug 23 '17 at 11:04
  • 1
    I have to try it and will get back to you. – utengr Aug 24 '17 at 11:16