I have tried all of the answers on SO related to BeautifulSoup not finding all of the links on a page, but none of them seem to work. I'm doing some academic research on Facebook and am trying to scrape from some status links the /hashtag/ elements, which are not available through the FB graph API. Here's a sample post: https://www.facebook.com/339278974073/posts/10151731033014074
If I run the following block of code:
import urllib2
from BeautifulSoup import BeautifulSoup
url = 'https://www.facebook.com/339278974073/posts/10151731033014074'
request = urllib2.Request(url)
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
and then look at the output of the variable 'soup', I can SEE that there are links with '/hashtag/' in them. A year or so ago, all I had to do was the following to find all instances of hashtags:
hashtag = soup.findAll('a', href=re.compile('/hashtag/?'))
Now, it appears to be broken because BeautifulSoup is not reading the block of text that contains the hashtags -- they are all in a 'hidden_elem' code class that I can see in the soup but BS is not reading it. Any answers would be appreciated!
Here is the part of the soup where BS is not finding anything (I apologize for the mess):
[<code class="hidden_elem" id="u_0_c"><!--<!-- <div class="_5pcb"><div class="_5jmm
_5pat _5uch _5uun" data-ft="{"fbfeed_location":5}" id="u_0_3"><div
class="clearfix userContentWrapper _5pcr"><a class="_5pb8"
href="https://www.facebook.com/IndianaOrganProcurementOrganization" data-
ft="{"tn":"\\u003C"}"><img class="_s0 _5xib _rw img"
src="https://fbcdn-profile-a.akamaihd.net/hprofile-ak-prn2/t1.0-
1/p50x50/602346_10151254741684074_596547152_s.jpg" alt="" /></a>
<div class="_5pax"><h5 class="_5yig _5pbw" data-
ft="{"tn":"C"}"><div class="fwn fcg">
<span class="fwb fcg" data-ft="{"tn":"k"}">
<a href="https://www.facebook.com/IndianaOrganProcurementOrganization">Indiana
Organ Procurement Organization</a></span></div></h5><div class="mbs _5pbx userContent"
data-ft="{"tn":"K"}"><p>Stop by our tent and get
your @jimmybuffet <a class="_58cn"
href="https://www.facebook.com/hashtag/pencilthinmustache?source=feed_text"
data-ft="{"tn":"*N","type":104}">
<span class="_58cl">#</span><span class="_58cm">PencilThinMustache</span></a>
and <a class="_58cn" href="https://www.facebook.com/hashtag/sayyes?source=feed_text"
.......
[some code deleted]
<div id="substream_pagelet" data-referrer="substream_pagelet"></div> -->--></code>]
What I would like to get is the text in the /hashtag/ url, such as 'PencilThinMustache', but I'd be happy with just getting the url at this point.