1

I have tried all of the answers on SO related to BeautifulSoup not finding all of the links on a page, but none of them seem to work. I'm doing some academic research on Facebook and am trying to scrape from some status links the /hashtag/ elements, which are not available through the FB graph API. Here's a sample post: https://www.facebook.com/339278974073/posts/10151731033014074

If I run the following block of code:

import urllib2
from BeautifulSoup import BeautifulSoup
url = 'https://www.facebook.com/339278974073/posts/10151731033014074'
request = urllib2.Request(url)
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)

and then look at the output of the variable 'soup', I can SEE that there are links with '/hashtag/' in them. A year or so ago, all I had to do was the following to find all instances of hashtags:

hashtag = soup.findAll('a', href=re.compile('/hashtag/?')) 

Now, it appears to be broken because BeautifulSoup is not reading the block of text that contains the hashtags -- they are all in a 'hidden_elem' code class that I can see in the soup but BS is not reading it. Any answers would be appreciated!

Here is the part of the soup where BS is not finding anything (I apologize for the mess):

[<code class="hidden_elem" id="u_0_c"><!--<!-- <div class="_5pcb"><div class="_5jmm  
 _5pat _5uch _5uun" data-ft="&#123;&quot;fbfeed_location&quot;:5&#125;" id="u_0_3"><div
class="clearfix userContentWrapper _5pcr"><a class="_5pb8" 
href="https://www.facebook.com/IndianaOrganProcurementOrganization" data-
ft="&#123;&quot;tn&quot;:&quot;\\u003C&quot;&#125;"><img class="_s0 _5xib _rw img" 
src="https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/t1.0-
1/p50x50/602346_10151254741684074_596547152_s.jpg" alt=""   /></a>
<div class="_5pax"><h5 class="_5yig _5pbw" data-
ft="&#123;&quot;tn&quot;:&quot;C&quot;&#125;"><div class="fwn fcg">
<span class="fwb fcg" data-ft="&#123;&quot;tn&quot;:&quot;k&quot;&#125;">
<a href="https://www.facebook.com/IndianaOrganProcurementOrganization">Indiana 
Organ Procurement Organization</a></span></div></h5><div class="mbs _5pbx userContent"
 data-ft="&#123;&quot;tn&quot;:&quot;K&quot;&#125;"><p>Stop by our tent and get 
your &#064;jimmybuffet <a class="_58cn"
href="https://www.facebook.com/hashtag/pencilthinmustache?source=feed_text" 
data-ft="&#123;&quot;tn&quot;:&quot;*N&quot;,&quot;type&quot;:104&#125;">
<span class="_58cl">‪#‎</span><span class="_58cm">PencilThinMustache‬</span></a>
 and <a class="_58cn" href="https://www.facebook.com/hashtag/sayyes?source=feed_text" 
 .......
[some code deleted]
<div id="substream_pagelet" data-referrer="substream_pagelet"></div> -->--></code>]

What I would like to get is the text in the /hashtag/ url, such as 'PencilThinMustache', but I'd be happy with just getting the url at this point.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Gregory Saxton
  • 1,241
  • 4
  • 13
  • 29
  • 1
    Please, do use [BeautifulSoup 4](http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup/22583436#22583436); 3 is way out of date. – Martijn Pieters Mar 23 '14 at 03:14
  • Thanks, I've tried it with BeautifulSoup 4 as well. It still doesn't read it. It won't find any of the URLs in this code block. – Gregory Saxton Mar 23 '14 at 03:18
  • That wasn't a solution, just a general remark that you shouldn't use BS3 anymore; I recognize your page-load code from the question I posted my linked answer to. – Martijn Pieters Mar 23 '14 at 03:24

2 Answers2

1

The problem is that links with a hashtag/ inside href attribute are inside an html comment.

One option is to find all comments on a page and search for the links inside:

comments = soup.find_all(text=lambda text:isinstance(text, Comment))
for comment in comments:
    comment_soup = BeautifulSoup(comment)
    links = comment_soup.find_all('a', href=re.compile('/hashtag/?'))
    if links:
        print links

prints:

[<a class="_58cn" data-ft='{"tn":"*N","type":104}' href="https://www.facebook.com/hashtag/pencilthinmustache?source=feed_text"><span class="_58cl">‪#‎</span><span class="_58cm">PencilThinMustache‬</span></a>, <a class="_58cn" data-ft='{"tn":"*N","type":104}' href="https://www.facebook.com/hashtag/sayyes?source=feed_text"><span class="_58cl">‪#‎</span><span class="_58cm">sayyes‬</span></a>, <a class="_58cn" data-ft='{"tn":"*N","type":104}' href="https://www.facebook.com/hashtag/donatelife?source=feed_text"><span class="_58cl">‪#‎</span><span class="_58cm">donatelife‬</span></a>]

Hope that helps.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
1

Your <code class="hidden_elem"> tag contains a HTML comment, not elements.

Parse those out as HTML separately:

>>> comment = soup.find('code').contents[0]
>>> type(comment)
<class 'BeautifulSoup.Comment'>
>>> BeautifulSoup(comment).findAll('a', href=re.compile('/hashtag/?'))
[<a class="_58cn" href="https://www.facebook.com/hashtag/pencilthinmustache?source=feed_text" data-ft='{"tn":"*N","type":104}'><span class="_58cl">‪#‎</span><span class="_58cm">PencilThinMustache‬</span></a>, <a class="_58cn" href="https://www.facebook.com/hashtag/sayyes?source=feed_text" data-ft='{"tn":"*N","type":104}'><span class="_58cl">‪#‎</span><span class="_58cm">sayyes‬</span></a>, <a class="_58cn" href="https://www.facebook.com/hashtag/donatelife?source=feed_text" data-ft='{"tn":"*N","type":104}'><span class="_58cl">‪#‎</span><span class="_58cm">donatelife‬</span></a>]
>>> for link in BeautifulSoup(comment).findAll('a', href=re.compile('/hashtag/?')):
...     print link.text
... 
‪#‎PencilThinMustache‬
‪#‎sayyes‬
‪#‎donatelife‬
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343